IMPORTANT NOTICE:

Please Read README.md before run the code!!! This repository uses Git Large File Storage (LFS) to track large data files. Otherwise, please use the link to download data and replace them with LFS format data.

1. Introduction

This report investigates potential biases in stop-and-search practices by UK police officers focusing on the impact of the individual’s sex, ethnicity, age group, and the area of the police force.

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()  masks stats::filter()
## ✖ purrr::flatten() masks jsonlite::flatten()
## ✖ dplyr::lag()     masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## 
## Attaching package: 'rvest'
## 
## 
## The following object is masked from 'package:readr':
## 
##     guess_encoding
## 
## The downloaded binary packages are in
##  /var/folders/w_/2mj5l0j94159cp96r28wxjr00000gn/T//RtmpeUu3cW/downloaded_packages
## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor
## 
## Breaking News: tmap 3.x is retiring. Please test v4, e.g. with
## remotes::install_github('r-tmap/tmap')
## Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
## 
## The downloaded binary packages are in
##  /var/folders/w_/2mj5l0j94159cp96r28wxjr00000gn/T//RtmpeUu3cW/downloaded_packages
## 
## Attaching package: 'gridExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine

2.Data

1) API Utilization:

Scraping stop-and-search data from the UK Police database.

Ethnicity Classification: Implementing a “combined_ethnicity” variable, blending self-defined and officer-observed ethnicities. This approach addresses the issue of missing data (20% in 2023) in self-reported ethnicity.

2) Census Data Integration:

• Dowanload Population of England and Wales from the 2011 and 2021 Censuses, broken down by Police Force Area and self-defined ethnicity group

As it only includes self-defined ethnicity, hence only used for force analysis.

• CSS selector to scrap 2021 Census Population ethnicity figure by age group
• Downloaded 2021 Census population by Age, ethnic group and sex from National Statistics

Defining “Reasonable Percentage”: Creating a metric to evaluate if searches correlate with legitimate police concerns or potential biases. This metric considers the proportion of outcome-linked searches.

Data discrepancies, such as the gender binary in 2021 Census, were noted for their potential impact on gender-based analysis.

3) Data Limitations :

Unavailability of data from certain police forces

## [1] "Greater Manchester Police: Due to a change in IT systems no crime, outcome or stop and search data is available from July 2019 onwards. The force are working to rectify this issue and provide the missing data over the coming months."
## [1] "Gwent Police: The force are currently rectifying the issues with the Stop and Search Data and are looking to provide the data once the issues are rectified."
## [1] "The Police Service of Northern Ireland does not currently provide stop and search data."

3. Analysis

Analysis involved SQL-driven data manipulation, focusing on search frequency, search rate per 1000 population, outcome-linked searches proportion against demographic variables, presenting by year and in total.

Assumption: the research could be reasonable based on the police experience, local crime rate, hence we introduce reasonable search percentage to assess it, if a realtive low proportion of research outcome not link to the research object but a relative high quantity of research was conducted, a potential bias might exist

3.1 Gender-Based Analysis

Pie chart shows gender distribution in police searches.

Bar charts compare search frequency, search per 1000 population and outcome-linked percentages by gender.

Line graphs compare the trend of search frequency by gender, search per 1000 population from 2021 to 2023.

A significant gender disparity exist in searches, with males constituting 89% of cases. Notably, while males have a higher search frequency, females show a higher rate of outcome-linked searches. This discrepancy raises questions about gender biases in search practices.

3.2 Ethnicity-Based Analysis

Pie chart shows ethnicity distribution in police searches.White has a dominant position in total quantity (62.65%) followed by Black (19.54%) and Asian (12.44%).

Bar charts reveals search frequency, search per 1000 population and outcome-linked percentages by ethnicity.

Line graphs indicates the trend of search frequency by ethnicity, search per 1000 population from 2021 to 2023.

The data reveals a pronounced ethnic bias, particularly against Black individuals, who face search rates over seven times higher than White individuals. The outcome-linked rate for Black individuals is significantly lower, suggesting searches are less frequently justified. Trends show some reduction in this bias, but the disparity remains stark.

3.3 Age-Based Analysis

Pie chart shows age distribution in police searches.

Bar charts compare search frequency, search per 1000 population and outcome-linked percentages by age.

Line graphs compare the trend of search frequency by age, search per 1000 population from 2021 to 2023.

The 18-24 age group is most frequently searched, with this group accounting for 31.73% of all searches, reducing from 2031 to 2023. The consistency in outcome-linked search proportions across age groups suggests a less pronounced age bias compared to gender or ethnicity. It is noticed that the age group over 34 contains a wider age range, hence result would be less intuitive.

3.4 Combined Ethnicity and Age Analysis

Group bar charts show average search frequency and rate by age within each ethnic group from 2021 to 2023.

It highlights acute biases, particularly against Black individuals aged 18-24. The data shows this group faces significantly higher search rates, underscoring the compounded effect of age and ethnicity in policing biases.

3.5 Combined Age and Sex Analysis

Bar chart compares average search proportion from 2021 to 2023 with population proportion between males and females across age groups.

Males consistently experience higher search rates than females in all age groups. This disparity is particularly evident in the over 25 age group, despite a larger female population.

3.6 Force-Based Analysis

## Reading layer `Layer #0' from data source 
##   `/Users/shengyuwen/Downloads/MY472_final_assignment-main/force kmls/avon-and-somerset.kml' 
##   using driver `KML'
## Simple feature collection with 1 feature and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XYZ
## Bounding box:  xmin: -3.839859 ymin: 50.82059 xmax: -2.244435 ymax: 51.68234
## z_range:       zmin: 0 zmax: 0
## Geodetic CRS:  WGS 84
## Reading layer `Layer #0' from data source 
##   `/Users/shengyuwen/Downloads/MY472_final_assignment-main/force kmls/bedfordshire.kml' 
##   using driver `KML'
## Simple feature collection with 1 feature and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XYZ
## Bounding box:  xmin: -0.70217 ymin: 51.80504 xmax: -0.143934 ymax: 52.3229
## z_range:       zmin: 0 zmax: 0
## Geodetic CRS:  WGS 84
## Reading layer `Layer #0' from data source 
##   `/Users/shengyuwen/Downloads/MY472_final_assignment-main/force kmls/cambridgeshire.kml' 
##   using driver `KML'
## Simple feature collection with 1 feature and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XYZ
## Bounding box:  xmin: -0.499971 ymin: 52.00566 xmax: 0.514519 ymax: 52.74003
## z_range:       zmin: 0 zmax: 0
## Geodetic CRS:  WGS 84
## Reading layer `Layer #0' from data source 
##   `/Users/shengyuwen/Downloads/MY472_final_assignment-main/force kmls/cheshire.kml' 
##   using driver `KML'
## Simple feature collection with 1 feature and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XYZ
## Bounding box:  xmin: -3.128264 ymin: 52.94715 xmax: -1.974779 ymax: 53.48092
## z_range:       zmin: 0 zmax: 0
## Geodetic CRS:  WGS 84
## Reading layer `Layer #0' from data source 
##   `/Users/shengyuwen/Downloads/MY472_final_assignment-main/force kmls/city-of-london.kml' 
##   using driver `KML'
## Simple feature collection with 1 feature and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XYZ
## Bounding box:  xmin: -0.113819 ymin: 51.50657 xmax: -0.07275199 ymax: 51.52332
## z_range:       zmin: 0 zmax: 0
## Geodetic CRS:  WGS 84
## Reading layer `Layer #0' from data source 
##   `/Users/shengyuwen/Downloads/MY472_final_assignment-main/force kmls/cleveland.kml' 
##   using driver `KML'
## Simple feature collection with 1 feature and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XYZ
## Bounding box:  xmin: -1.452617 ymin: 54.46415 xmax: -0.788388 ymax: 54.72717
## z_range:       zmin: 0 zmax: 0
## Geodetic CRS:  WGS 84
## Reading layer `Layer #0' from data source 
##   `/Users/shengyuwen/Downloads/MY472_final_assignment-main/force kmls/cumbria.kml' 
##   using driver `KML'
## Simple feature collection with 1 feature and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XYZ
## Bounding box:  xmin: -3.639771 ymin: 54.04425 xmax: -2.159007 ymax: 55.18898
## z_range:       zmin: 0 zmax: 0
## Geodetic CRS:  WGS 84
## Reading layer `Layer #0' from data source 
##   `/Users/shengyuwen/Downloads/MY472_final_assignment-main/force kmls/derbyshire.kml' 
##   using driver `KML'
## Simple feature collection with 1 feature and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XYZ
## Bounding box:  xmin: -2.034083 ymin: 52.69652 xmax: -1.166476 ymax: 53.54046
## z_range:       zmin: 0 zmax: 0
## Geodetic CRS:  WGS 84
## Reading layer `Layer #0' from data source 
##   `/Users/shengyuwen/Downloads/MY472_final_assignment-main/force kmls/devon-and-cornwall.kml' 
##   using driver `KML'
## Simple feature collection with 1 feature and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XYZ
## Bounding box:  xmin: -6.394892 ymin: 49.8821 xmax: -2.8838 ymax: 51.2465
## z_range:       zmin: 0 zmax: 0
## Geodetic CRS:  WGS 84
## Reading layer `Layer #0' from data source 
##   `/Users/shengyuwen/Downloads/MY472_final_assignment-main/force kmls/dorset.kml' 
##   using driver `KML'
## Simple feature collection with 1 feature and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XYZ
## Bounding box:  xmin: -2.961634 ymin: 50.51279 xmax: -1.681678 ymax: 51.08094
## z_range:       zmin: 0 zmax: 0
## Geodetic CRS:  WGS 84
## Reading layer `Layer #0' from data source 
##   `/Users/shengyuwen/Downloads/MY472_final_assignment-main/force kmls/durham.kml' 
##   using driver `KML'
## Simple feature collection with 1 feature and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XYZ
## Bounding box:  xmin: -2.35572 ymin: 54.4447 xmax: -1.242854 ymax: 54.91866
## z_range:       zmin: 0 zmax: 0
## Geodetic CRS:  WGS 84
## Reading layer `Layer #0' from data source 
##   `/Users/shengyuwen/Downloads/MY472_final_assignment-main/force kmls/dyfed-powys.kml' 
##   using driver `KML'
## Simple feature collection with 1 feature and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XYZ
## Bounding box:  xmin: -5.670391 ymin: 51.59593 xmax: -2.949653 ymax: 52.90154
## z_range:       zmin: 0 zmax: 0
## Geodetic CRS:  WGS 84
## Reading layer `Layer #0' from data source 
##   `/Users/shengyuwen/Downloads/MY472_final_assignment-main/force kmls/essex.kml' 
##   using driver `KML'
## Simple feature collection with 1 feature and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XYZ
## Bounding box:  xmin: -0.01976299 ymin: 51.44827 xmax: 1.296604 ymax: 52.09266
## z_range:       zmin: 0 zmax: 0
## Geodetic CRS:  WGS 84
## Reading layer `Layer #0' from data source 
##   `/Users/shengyuwen/Downloads/MY472_final_assignment-main/force kmls/gloucestershire.kml' 
##   using driver `KML'
## Simple feature collection with 1 feature and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XYZ
## Bounding box:  xmin: -2.687567 ymin: 51.57749 xmax: -1.615209 ymax: 52.11254
## z_range:       zmin: 0 zmax: 0
## Geodetic CRS:  WGS 84
## Reading layer `Layer #0' from data source 
##   `/Users/shengyuwen/Downloads/MY472_final_assignment-main/force kmls/greater-manchester.kml' 
##   using driver `KML'
## Simple feature collection with 1 feature and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XYZ
## Bounding box:  xmin: -2.730452 ymin: 53.32729 xmax: -1.909627 ymax: 53.68571
## z_range:       zmin: 0 zmax: 0
## Geodetic CRS:  WGS 84
## Reading layer `Layer #0' from data source 
##   `/Users/shengyuwen/Downloads/MY472_final_assignment-main/force kmls/gwent.kml' 
##   using driver `KML'
## Simple feature collection with 1 feature and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XYZ
## Bounding box:  xmin: -3.334294 ymin: 51.49977 xmax: -2.649766 ymax: 51.98299
## z_range:       zmin: 0 zmax: 0
## Geodetic CRS:  WGS 84
## Reading layer `Layer #0' from data source 
##   `/Users/shengyuwen/Downloads/MY472_final_assignment-main/force kmls/hampshire.kml' 
##   using driver `KML'
## Simple feature collection with 1 feature and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XYZ
## Bounding box:  xmin: -1.957275 ymin: 50.5532 xmax: -0.72932 ymax: 51.38393
## z_range:       zmin: 0 zmax: 0
## Geodetic CRS:  WGS 84
## Reading layer `Layer #0' from data source 
##   `/Users/shengyuwen/Downloads/MY472_final_assignment-main/force kmls/hertfordshire.kml' 
##   using driver `KML'
## Simple feature collection with 1 feature and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XYZ
## Bounding box:  xmin: -0.745752 ymin: 51.59958 xmax: 0.195581 ymax: 52.08054
## z_range:       zmin: 0 zmax: 0
## Geodetic CRS:  WGS 84
## Reading layer `Layer #0' from data source 
##   `/Users/shengyuwen/Downloads/MY472_final_assignment-main/force kmls/humberside.kml' 
##   using driver `KML'
## Simple feature collection with 1 feature and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XYZ
## Bounding box:  xmin: -1.10306 ymin: 53.43294 xmax: 0.150174 ymax: 54.17726
## z_range:       zmin: 0 zmax: 0
## Geodetic CRS:  WGS 84
## Reading layer `Layer #0' from data source 
##   `/Users/shengyuwen/Downloads/MY472_final_assignment-main/force kmls/kent.kml' 
##   using driver `KML'
## Simple feature collection with 1 feature and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XYZ
## Bounding box:  xmin: 0.03513401 ymin: 50.90988 xmax: 1.453548 ymax: 51.50349
## z_range:       zmin: 0 zmax: 0
## Geodetic CRS:  WGS 84
## Reading layer `Layer #0' from data source 
##   `/Users/shengyuwen/Downloads/MY472_final_assignment-main/force kmls/lancashire.kml' 
##   using driver `KML'
## Simple feature collection with 1 feature and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XYZ
## Bounding box:  xmin: -3.084533 ymin: 53.48277 xmax: -2.045058 ymax: 54.23956
## z_range:       zmin: 0 zmax: 0
## Geodetic CRS:  WGS 84
## Reading layer `Layer #0' from data source 
##   `/Users/shengyuwen/Downloads/MY472_final_assignment-main/force kmls/leicestershire.kml' 
##   using driver `KML'
## Simple feature collection with 1 feature and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XYZ
## Bounding box:  xmin: -1.597541 ymin: 52.39217 xmax: -0.428365 ymax: 52.97762
## z_range:       zmin: 0 zmax: 0
## Geodetic CRS:  WGS 84
## Reading layer `Layer #0' from data source 
##   `/Users/shengyuwen/Downloads/MY472_final_assignment-main/force kmls/lincolnshire.kml' 
##   using driver `KML'
## Simple feature collection with 1 feature and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XYZ
## Bounding box:  xmin: -0.81965 ymin: 52.64111 xmax: 0.356051 ymax: 53.61725
## z_range:       zmin: 0 zmax: 0
## Geodetic CRS:  WGS 84
## Reading layer `Layer #0' from data source 
##   `/Users/shengyuwen/Downloads/MY472_final_assignment-main/force kmls/merseyside.kml' 
##   using driver `KML'
## Simple feature collection with 1 feature and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XYZ
## Bounding box:  xmin: -3.200624 ymin: 53.29777 xmax: -2.576665 ymax: 53.69416
## z_range:       zmin: 0 zmax: 0
## Geodetic CRS:  WGS 84
## Reading layer `Layer #0' from data source 
##   `/Users/shengyuwen/Downloads/MY472_final_assignment-main/force kmls/metropolitan.kml' 
##   using driver `KML'
## Simple feature collection with 1 feature and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XYZ
## Bounding box:  xmin: -0.510389 ymin: 51.28677 xmax: 0.334003 ymax: 51.69188
## z_range:       zmin: 0 zmax: 0
## Geodetic CRS:  WGS 84
## Reading layer `Layer #0' from data source 
##   `/Users/shengyuwen/Downloads/MY472_final_assignment-main/force kmls/norfolk.kml' 
##   using driver `KML'
## Simple feature collection with 1 feature and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XYZ
## Bounding box:  xmin: 0.153544 ymin: 52.35535 xmax: 1.744527 ymax: 52.99273
## z_range:       zmin: 0 zmax: 0
## Geodetic CRS:  WGS 84
## Reading layer `Layer #0' from data source 
##   `/Users/shengyuwen/Downloads/MY472_final_assignment-main/force kmls/north-wales.kml' 
##   using driver `KML'
## Simple feature collection with 1 feature and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XYZ
## Bounding box:  xmin: -4.767852 ymin: 52.53462 xmax: -2.724131 ymax: 53.43011
## z_range:       zmin: 0 zmax: 0
## Geodetic CRS:  WGS 84
## Reading layer `Layer #0' from data source 
##   `/Users/shengyuwen/Downloads/MY472_final_assignment-main/force kmls/north-yorkshire.kml' 
##   using driver `KML'
## Simple feature collection with 1 feature and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XYZ
## Bounding box:  xmin: -2.564724 ymin: 53.6211 xmax: -0.212338 ymax: 54.56214
## z_range:       zmin: 0 zmax: 0
## Geodetic CRS:  WGS 84
## Reading layer `Layer #0' from data source 
##   `/Users/shengyuwen/Downloads/MY472_final_assignment-main/force kmls/northamptonshire.kml' 
##   using driver `KML'
## Simple feature collection with 1 feature and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XYZ
## Bounding box:  xmin: -1.332332 ymin: 51.97727 xmax: -0.341593 ymax: 52.6436
## z_range:       zmin: 0 zmax: 0
## Geodetic CRS:  WGS 84
## Reading layer `Layer #0' from data source 
##   `/Users/shengyuwen/Downloads/MY472_final_assignment-main/force kmls/northern-ireland.kml' 
##   using driver `KML'
## Simple feature collection with 1 feature and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XYZ
## Bounding box:  xmin: -8.624268 ymin: 53.89139 xmax: -5.119629 ymax: 55.44148
## z_range:       zmin: 0 zmax: 0
## Geodetic CRS:  WGS 84
## Reading layer `Layer #0' from data source 
##   `/Users/shengyuwen/Downloads/MY472_final_assignment-main/force kmls/northumbria.kml' 
##   using driver `KML'
## Simple feature collection with 1 feature and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XYZ
## Bounding box:  xmin: -2.689774 ymin: 54.78238 xmax: -1.345651 ymax: 55.81167
## z_range:       zmin: 0 zmax: 0
## Geodetic CRS:  WGS 84
## Reading layer `Layer #0' from data source 
##   `/Users/shengyuwen/Downloads/MY472_final_assignment-main/force kmls/nottinghamshire.kml' 
##   using driver `KML'
## Simple feature collection with 1 feature and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XYZ
## Bounding box:  xmin: -1.344601 ymin: 52.78939 xmax: -0.666214 ymax: 53.50251
## z_range:       zmin: 0 zmax: 0
## Geodetic CRS:  WGS 84
## Reading layer `Layer #0' from data source 
##   `/Users/shengyuwen/Downloads/MY472_final_assignment-main/force kmls/south-wales.kml' 
##   using driver `KML'
## Simple feature collection with 1 feature and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XYZ
## Bounding box:  xmin: -4.332844 ymin: 51.37828 xmax: -3.067635 ymax: 51.83467
## z_range:       zmin: 0 zmax: 0
## Geodetic CRS:  WGS 84
## Reading layer `Layer #0' from data source 
##   `/Users/shengyuwen/Downloads/MY472_final_assignment-main/force kmls/south-yorkshire.kml' 
##   using driver `KML'
## Simple feature collection with 1 feature and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XYZ
## Bounding box:  xmin: -1.821315 ymin: 53.30107 xmax: -0.865676 ymax: 53.66083
## z_range:       zmin: 0 zmax: 0
## Geodetic CRS:  WGS 84
## Reading layer `Layer #0' from data source 
##   `/Users/shengyuwen/Downloads/MY472_final_assignment-main/force kmls/staffordshire.kml' 
##   using driver `KML'
## Simple feature collection with 1 feature and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XYZ
## Bounding box:  xmin: -2.470842 ymin: 52.42324 xmax: -1.585469 ymax: 53.22622
## z_range:       zmin: 0 zmax: 0
## Geodetic CRS:  WGS 84
## Reading layer `Layer #0' from data source 
##   `/Users/shengyuwen/Downloads/MY472_final_assignment-main/force kmls/suffolk.kml' 
##   using driver `KML'
## Simple feature collection with 1 feature and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XYZ
## Bounding box:  xmin: 0.340007 ymin: 51.93207 xmax: 1.768976 ymax: 52.55018
## z_range:       zmin: 0 zmax: 0
## Geodetic CRS:  WGS 84
## Reading layer `Layer #0' from data source 
##   `/Users/shengyuwen/Downloads/MY472_final_assignment-main/force kmls/surrey.kml' 
##   using driver `KML'
## Simple feature collection with 1 feature and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XYZ
## Bounding box:  xmin: -0.848915 ymin: 51.0715 xmax: 0.05823101 ymax: 51.47156
## z_range:       zmin: 0 zmax: 0
## Geodetic CRS:  WGS 84
## Reading layer `Layer #0' from data source 
##   `/Users/shengyuwen/Downloads/MY472_final_assignment-main/force kmls/sussex.kml' 
##   using driver `KML'
## Simple feature collection with 1 feature and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XYZ
## Bounding box:  xmin: -0.958077 ymin: 50.72167 xmax: 0.867958 ymax: 51.1671
## z_range:       zmin: 0 zmax: 0
## Geodetic CRS:  WGS 84
## Reading layer `Layer #0' from data source 
##   `/Users/shengyuwen/Downloads/MY472_final_assignment-main/force kmls/thames-valley.kml' 
##   using driver `KML'
## Simple feature collection with 1 feature and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XYZ
## Bounding box:  xmin: -1.719514 ymin: 51.32891 xmax: -0.476606 ymax: 52.19628
## z_range:       zmin: 0 zmax: 0
## Geodetic CRS:  WGS 84
## Reading layer `Layer #0' from data source 
##   `/Users/shengyuwen/Downloads/MY472_final_assignment-main/force kmls/warwickshire.kml' 
##   using driver `KML'
## Simple feature collection with 1 feature and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XYZ
## Bounding box:  xmin: -1.962019 ymin: 51.95535 xmax: -1.172136 ymax: 52.68722
## z_range:       zmin: 0 zmax: 0
## Geodetic CRS:  WGS 84
## Reading layer `Layer #0' from data source 
##   `/Users/shengyuwen/Downloads/MY472_final_assignment-main/force kmls/west-mercia.kml' 
##   using driver `KML'
## Simple feature collection with 1 feature and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XYZ
## Bounding box:  xmin: -3.235551 ymin: 51.82595 xmax: -1.757395 ymax: 52.9984
## z_range:       zmin: 0 zmax: 0
## Geodetic CRS:  WGS 84
## Reading layer `Layer #0' from data source 
##   `/Users/shengyuwen/Downloads/MY472_final_assignment-main/force kmls/west-midlands.kml' 
##   using driver `KML'
## Simple feature collection with 1 feature and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XYZ
## Bounding box:  xmin: -2.206875 ymin: 52.34763 xmax: -1.42396 ymax: 52.66277
## z_range:       zmin: 0 zmax: 0
## Geodetic CRS:  WGS 84
## Reading layer `Layer #0' from data source 
##   `/Users/shengyuwen/Downloads/MY472_final_assignment-main/force kmls/west-yorkshire.kml' 
##   using driver `KML'
## Simple feature collection with 1 feature and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XYZ
## Bounding box:  xmin: -2.173273 ymin: 53.51973 xmax: -1.198798 ymax: 53.96316
## z_range:       zmin: 0 zmax: 0
## Geodetic CRS:  WGS 84
## Reading layer `Layer #0' from data source 
##   `/Users/shengyuwen/Downloads/MY472_final_assignment-main/force kmls/wiltshire.kml' 
##   using driver `KML'
## Simple feature collection with 1 feature and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XYZ
## Bounding box:  xmin: -2.36559 ymin: 50.94498 xmax: -1.485719 ymax: 51.70314
## z_range:       zmin: 0 zmax: 0
## Geodetic CRS:  WGS 84

## [1] "Unmatched Names in force_df_sum:"
## character(0)
## [1] "Unmatched Names in all_kml_data:"
## [1] "greater-manchester" "gwent"              "north-yorkshire"   
## [4] "northern-ireland"
## [1] "Number of rows with 'MULTIPOLYGON Z EMPTY' geometry: 0"
## Warning: st_centroid assumes attributes are constant over geometries
## st_as_s2(): dropping Z and/or M coordinate
## Scale on map varies by more than 10%, scale bar may be inaccurate

## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:httr':
## 
##     config
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
## Loading required package: viridisLite
## 
## Attaching package: 'viridis'
## The following object is masked from 'package:scales':
## 
##     viridis_pal
## Scale on map varies by more than 10%, scale bar may be inaccurate

Maps show search rates and average search frequency from 2021 to 2023 by police force area.

The City of London exhibits the highest search rates, followed by Merseyside and the Metropolitan area. Notably, the Metropolitan area leads in the total number of searches.

3.7 Daily and Hourly Search Frequency

## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `datetime = ymd_hms(datetime)`.
## Caused by warning:
## !  25890 failed to parse.

Bar charts depicting search frequency throughout the day and across different days of the week.

A concentration of searches between 4 PM and 1 AM, declining in the morning, indicates a focus on evening and nighttime. Searches are less frequent on Sundays, peaking on Fridays. This uneven distribution across times and days could skew results, potentially overemphasizing certain behaviors or demographic groups with bias.

Appendix: All code in this assignment

# this chunk contains code that sets global options for the entire .Rmd. 
# we use include=FALSE to suppress it from the top of the document, but it will still appear in the appendix. 

knitr::opts_chunk$set(echo = FALSE) # actually set the global chunk options. 
# we set echo=FALSE to suppress code such that it by default does not appear throughout the document. 
# note: this is different from .Rmd default
library(httr)
library(jsonlite)
library(tidyverse)
library(jpeg) #to let us read .jpegs/.jpgs
library(grid) #to let us plot images
library(rvest)
library(dplyr)
library(DBI)
library(ggplot2)
options(repos = c(CRAN = "https://cran.rstudio.com/"))
install.packages("scales")
library(scales)
library(tmap)
library(classInt)
library(sf)
install.packages("ggspatial")
library(ggspatial)
library(gridExtra)
#use API to scarp data, get the date range: from 2020-12 until 2023-11,exactly 36 months, 3-year data
crimes_street_dates_url <- "https://data.police.uk/api/crimes-street-dates"

#fromJSON(crimes_street_dates_url) #run this function to test available data range ,which is 2020-12 until 2023-11


# use API to scrap all the police forces
forces_url <- "https://data.police.uk/api/forces"

forces<- fromJSON(forces_url)
# print(forces$id)

stop_and_searches_by_force_url <-"https://data.police.uk/api/stops-force?force=avon-and-somerset&date=2021-11"
# fromJSON(stop_and_searches_by_force_url)


# write a function to use api to scrapping data
get_force_data_monthly <- function(year, month, force) {
  formatted_month <- formatC(month, width = 2, format = "d", flag = "0")
  url <- paste0("https://data.police.uk/api/stops-force?force=", force, "&date=", year, "-", formatted_month)
  response <- GET(url)
  if (status_code(response) == 200) {
    data <- fromJSON(rawToChar(response$content), flatten = TRUE)
    # check whether it return data
    if (is.data.frame(data) && nrow(data) > 0) {
      # add police force id and year to the result dataframe
      data$force <- force
      data$year <- year
      return(data)
    }
  }
  # return empty data frame if no data or request failed
  return(data.frame(year = integer(0), force = character(0), stringsAsFactors = FALSE))
}

# build an empty list to store data
all_data <- list()
# use loop to go through all month from 2021 to 2023( As we decide to use 2021 census data to eliminate scale influence) 
years <- 2021:2023
for (force in forces$id) {
  for (year in years) {
    for (month in 1:12) {
      message("Fetching data for ", force, " in ", year, "-", month)
      monthly_data <- get_force_data_monthly(year, month, force)
      all_data[[length(all_data) + 1]] <- monthly_data
    }
  }
}

# combine all data frame into one and fill NA for empty cell
all_data_df <- bind_rows(all_data).  # this is original data

#print(head(all_data_df))
# store data at local
write.csv(all_data_df, "all_data_df.csv", row.names = FALSE)
#import research data
stop_and_searches_2021_2023<- read.csv("all_data_df.csv")
#head(stop_and_searches_2021_2023)
# Rearrange columns
stop_and_searches_2021_2023<- stop_and_searches_2021_2023 %>%
  select(year, force, age_range, gender, self_defined_ethnicity, officer_defined_ethnicity, 
         outcome_linked_to_object_of_search, datetime, everything())
# head(stop_and_searches_2021_2023)

#edit data format to makeit coherenent among all data used
stop_and_searches_2021_2023<- stop_and_searches_2021_2023 %>%
  mutate(
    combined_ethnicity = ifelse(
      !is.na(officer_defined_ethnicity),
      officer_defined_ethnicity,
      case_when(
        str_detect(self_defined_ethnicity, "Black|African|Caribbean") ~ "Black",
        str_detect(self_defined_ethnicity, "Asian|Asian") ~ "Asian or Asian British",
        str_detect(self_defined_ethnicity, "Mixed|Multiple") ~ "Mixed",
        # Add more patterns as necessary
        TRUE ~ word(self_defined_ethnicity, 1)  # Fallback to the first word
      )
    ),
    combined_ethnicity = case_when(
      combined_ethnicity == "Asian" ~ "Asian or Asian British",
      combined_ethnicity == "Black" ~ "Black or Black British",
      combined_ethnicity == "Other" ~ "Other Ethnic Group",
      TRUE ~ combined_ethnicity
    )
  ) %>%
  relocate(
    combined_ethnicity,
    .after = officer_defined_ethnicity
  )


#install.packages(c("DBI", "RSQLite"))
library(DBI)

# Create database: This will create a file in our hard drive if it does not exist already. 
db <- dbConnect(RSQLite::SQLite(), "database/assignment4.sqlite")
db_exists <- file.exists("database/assignment4.sqlite")
# db_exists


# write the called API data into databse after manipulation 
dbWriteTable(db, "stop_and_searches_2021_2023", stop_and_searches_2021_2023, row.names = FALSE, overwrite = TRUE)

stop_search_mar21_mar23 <- read.csv("stop-search-open-data-tables-mar21-mar23.csv")

dbWriteTable(db, "stop_search_mar21_mar23", stop_search_mar21_mar23, row.names = FALSE, overwrite = TRUE)
tables <- dbListTables(db)
print(tables)
dbListFields(db,"stop_search_mar21_mar23")

#check how many forces stop and research data are included
unique_values_in_force <- unique(stop_and_searches_2021_2023$force)
#unique_values_in_force
library(dplyr)

# import 2021 census data, population by force and ethnicity 
population_by_force_by_ethnicity_21census<- read.csv("population-by-force-by-ethnicity.csv")
population_by_force_by_ethnicity_21census <- population_by_force_by_ethnicity_21census %>%
  mutate(
    combinedethnicity = case_when(
       grepl("Mixed", Self.defined.ethnicity) ~ "Mixed",  # Assign "Mixed" when Self.defined.ethnicity is "Mixed"

      TRUE ~ Self.defined.ethnicity
    ),
    Police.Force.Area = case_when(
      Police.Force.Area == "Avon and Somerset" ~ "avon-and-somerset",
      Police.Force.Area == "Devon and Cornwall" ~ "devon-and-cornwall",
      Police.Force.Area == "London, City of" ~ "city-of-london",
      Police.Force.Area == "Dyfed-Powys" ~ "dyfed-powys",
      Police.Force.Area == "North Wales" ~ "north-wales",
      Police.Force.Area == "South Wales" ~ "south-wales",
      Police.Force.Area == "Thames Valley" ~ "thames-valley",
      Police.Force.Area == "South Yorkshire" ~ "south-yorkshire",
      Police.Force.Area == "North Yorkshire" ~ "north-yorkshire",
      Police.Force.Area == "West Mercia" ~ "west-mercia",
      Police.Force.Area == "West Midlands" ~ "west-midlands",
      Police.Force.Area == "West Yorkshire" ~ "west-yorkshire",
       Police.Force.Area == "Greater Manchester" ~ "greater-manchester",
       Police.Force.Area == "Metropolitan Police" ~ "Metropolitan",
      
      TRUE ~ Police.Force.Area
    )
  )

population_by_force_by_ethnicity_21census <- population_by_force_by_ethnicity_21census %>%
  filter(Census.Year != 2011)

population_by_force_by_ethnicity_21census$Police.Force.Area <- tolower(population_by_force_by_ethnicity_21census$Police.Force.Area)

population_by_force_by_ethnicity_21census$Population <- as.numeric(gsub(",", "", population_by_force_by_ethnicity_21census$Population))
#head(population_by_force_by_ethnicity_21census)


dbWriteTable(db, "population_by_force_by_ethnicity_21census", population_by_force_by_ethnicity_21census, row.names = FALSE, overwrite = TRUE)

tables <- dbListTables(db)

# print(tables)


# dbListFields(db,"population_by_force_by_ethnicity_21census")
library(rvest)
library(xml2)

# URL of the webpage to scrape
url <- "https://www.ethnicity-facts-figures.service.gov.uk/uk-population-by-ethnicity/demographics/age-groups/latest/"

# Read the HTML content from the website
webpage <- read_html(url)

# Extract the table
table <- html_node(webpage, "#main-content > div:nth-child(12) > div > div:nth-child(1) > div > div > div > table")

# Convert the table into a dataframe
df <- html_table(table, fill = TRUE)

# Setting the second row as header
colnames(df) <- as.character(unlist(df[2, ]))

# Removing the first two rows
df <- df[-c(1, 2), ]

# View the modified dataframe
# head(df)
df <- rename(df, Age = `Age brackets`)


# Gathering all columns except the first one into key-value pairs
age_ethnicity <- gather(df, key = "key", value = "value", -1)

# Creating separate columns for 'ethnicity' and 'type' (percentage or count)
age_ethnicity<- age_ethnicity %>%
  separate(key, into = c("ethnicity", "type"), sep = " ") %>%
  spread(type, value)

# Renaming columns for clarity
colnames(age_ethnicity) <- c("age_group", "ethnicity", "Percentage", "Population")

age_ethnicity$Population <- as.numeric(gsub(",", "", age_ethnicity$Population))
# View the transformed dataframe
# head(age_ethnicity)


# Function to map age groups
map_age_group <- function(age_group) {
  if (age_group %in% c("Age 0 to 4", "Age 5 to 9")) {
    return("under 10")
  } else if (age_group %in% c("Age 10 to 14", "Age 15 to 17")) {
    return("10-17")
  } else if (age_group == "Age 18 to 24") {
    return("18-24")
  } else if (age_group %in% c("Age 25 to 29", "Age 30 to 34")) {
    return("25-34")
  } else {
    return("over 34")
  }
}

# Apply the function to the dataframe
age_ethnicity$age_group <- sapply(age_ethnicity$age_group, map_age_group)


age_ethnicity<- age_ethnicity %>%
  mutate(
    ethnicity = case_when(
      ethnicity == "Asian" ~ "Asian or Asian British",
      ethnicity == "Black" ~ "Black or Black British",
      ethnicity == "Other" ~ "Other Ethnic Group",
      TRUE ~ ethnicity  # Keeps other values unchanged
    )
  )

age_ethnicity <- age_ethnicity %>%
  filter(ethnicity != "All")
# View the modified dataframe
# head(age_ethnicity)
dbWriteTable(db, "age_ethnicity", age_ethnicity , row.names = FALSE, overwrite = TRUE)

library(dplyr)
library(RSelenium)
library(rvest)
library("netstat")
library(stringr)
 driver <- rsDriver(browser = "firefox",  port = free_port(random = TRUE), chromever = NULL) 
  remDr <- driver[["client"]]
  url<-"https://data.police.uk/changelog/"
  url2<-"https://data.police.uk/data/"
  remDr$navigate(url)
  
  html <- read_html(url)
greater_manchester<- html%>% html_elements(css=" #content > div:nth-child(2) > ul:nth-child(3) > li:nth-child(5)" )
greater_manchester_text <- greater_manchester%>% html_text()
greater_manchester_text<- gsub("^\\s+|\\s+$", "", greater_manchester_text)

gwent<- html%>% html_elements(css=" #content > div:nth-child(2) > ul:nth-child(3) > li:nth-child(6)" )
gwent_text <-gwent%>% html_text()
gwent_text<- gsub("^\\s+|\\s+$", "", gwent_text)

 html <- read_html(url2)
north_ireland<- html%>% html_elements(css=" #downloads > p:nth-child(2)" )
north_ireland_text <- north_ireland%>% html_text()

writeLines(greater_manchester_text, "greater_manchester_text.txt")
writeLines(gwent_text, "gwent_text.txt")
writeLines(north_ireland_text, "north_ireland_text.txt")
greater_manchester_text <- readLines("greater_manchester_text.txt")
gwent_text <- readLines("gwent_text.txt")
north_ireland_text <- readLines("north_ireland_text.txt")

greater_manchester_text
gwent_text
north_ireland_text
ethnicity_age_sex_census<- read.csv("age-sex-ethnicity.csv")

# Selecting specific columns and renaming them
ethnicity_age_sex_census <- ethnicity_age_sex_census %>%
  select(
    age = `Age..86.categories..Code`, 
    ethnicity_group = `Ethnic.group..6.categories.`,
    sex = `Sex..2.categories.`,
    observation = Observation
  )

# View the modified dataframe
# shead(ethnicity_age_sex_census)

age_groups <- c("under 10", "10-17", "18-24", "25-34", "over 34")
ethnicity_age_sex_census <- ethnicity_age_sex_census %>%
  mutate(age_group = cut(age, 
                         breaks = c(-Inf, 10, 18, 25, 34, Inf),
                         labels = age_groups,
                         right = FALSE))

ethnicity_age_sex_census <- ethnicity_age_sex_census %>%
  mutate(
    ethnicity_group = case_when(
      ethnicity_group == "Asian, Asian British or Asian Welsh" ~ "Asian or Asian British",
      ethnicity_group == "Black, Black British, Black Welsh, Caribbean or African" ~ "Black or Black British",
      ethnicity_group == "Mixed or Multiple ethnic groups" ~ "Mixed",
      ethnicity_group == "Other ethnic group" ~ "Other Ethnic Group",
      TRUE ~ ethnicity_group  # Keeps other values unchanged
    )
  )

ethnicity_age_sex_census <- select(ethnicity_age_sex_census, -age)
# head(ethnicity_age_sex_census)


dbWriteTable(db, "ethnicity_age_sex_census", ethnicity_age_sex_census , row.names = FALSE, overwrite = TRUE)
#This is a brief summary about the 2021 Census population data against research variable gender, ethnicity, age and local force population, which would be use to eliminate the scale influene introdued by total number of searches.

age_population_df <- dbGetQuery(db,"
SELECT 
    age_group,
    SUM(REPLACE(Population, ',', '') + 0) AS TotalPopulation
    FROM age_ethnicity
GROUP BY age_group")


ethnicity_population_df <- dbGetQuery(db, "
  SELECT 
     combinedethnicity,
     SUM(REPLACE(Population, ',', '') + 0) AS TotalPopulation
  FROM population_by_force_by_ethnicity_21census
  GROUP BY combinedethnicity")


force_population_df <- dbGetQuery(db, "
  SELECT 
      `Police.Force.Area`,
     SUM(REPLACE(Population, ',', '') + 0) AS TotalPopulation
  FROM population_by_force_by_ethnicity_21census
  GROUP BY `Police.Force.Area`")


sex_population_df <- dbGetQuery(db, "
 SELECT 
    sex,
    SUM(REPLACE(observation, ',', '') + 0) AS TotalPopulation
    FROM ethnicity_age_sex_census
GROUP BY sex")
age_population_df
ethnicity_population_df
force_population_df
sex_population_df

#indentify the potential value in gender column from orginal databse
unique_values_in_gender <- unique(stop_and_searches_2021_2023$gender)
# unique_values_in_gender
 
sex_sum <- dbGetQuery(db, "
 SELECT 
   gender,
   COUNT(*) AS number_of_searches,
   COUNT(*)/3 AS average_number_of_searches_per_year,
   
   (COUNT(*) * 1000.0) /  
   (3* (SELECT SUM(REPLACE(psex.observation, ',', '') + 0) FROM ethnicity_age_sex_census psex  
   WHERE psex.sex = ss.gender) )AS searches_per_1000_population_per_year,
   
   (COUNT(*) * 100.0) / 
   (SELECT COUNT(*) FROM stop_and_searches_2021_2023 
   WHERE gender IS NOT NULL) AS search_percentage,

   (SUM(CASE WHEN outcome_linked_to_object_of_search = TRUE THEN 1 ELSE 0 END) * 100.0) / 
   NULLIF(COUNT(*), 0) AS reasonable_search_percentage
   
FROM stop_and_searches_2021_2023  ss
WHERE gender IS NOT NULL
GROUP BY gender
")
# sex_sum

sex_df_year <- dbGetQuery(db, "
 SELECT
 year,
 gender,
 COUNT(*) AS number_of_searches,
 
 (COUNT(*) * 1000.0) / 
 ((SELECT SUM(REPLACE(psex.observation, ',', '') + 0) FROM ethnicity_age_sex_census psex  
 WHERE psex.sex = ss.gender) )AS searches_per_1000_population_per_year,

 (COUNT(*) * 100.0) / 
 (SELECT COUNT(*)
 FROM stop_and_searches_2021_2023
 WHERE year = ss.year AND gender IS NOT NULL) AS search_percentage_per_year,

 
 (SUM(CASE WHEN outcome_linked_to_object_of_search = TRUE THEN 1 ELSE 0 END) * 100.0) / NULLIF(COUNT(*), 0) AS reasonable_search_percentage
 

FROM stop_and_searches_2021_2023 ss
WHERE gender IS NOT NULL
GROUP BY year, gender
")
# sex_df_year
library(ggplot2)
options(repos = c(CRAN = "https://cran.rstudio.com/"))
install.packages("scales")
library(scales)
# Create the pie chart
pie_chart <- ggplot(sex_sum, aes(x = "", y = search_percentage, fill = gender)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y", start = 0) +
  theme_void() +
  scale_fill_brewer(palette = "Set1", name = "Gender", labels = paste(sex_sum$gender, ": ", round(sex_sum$search_percentage, 2), "%")) +
  theme(legend.position = "bottom",legend.title = element_blank(),
        plot.caption = element_text(hjust = 0.5)) +
  labs(subtitle = "Percentage of Stop and Searches by Sex",
       caption = "Note:\n• Does not include vehicle only searches.\n• Does not include other sexes. \n• Does not include searches record where sex is unkown",
        theme(
          #plot.title = element_text(size = 14),
              legend.margin = margin(1, 1, 1, 1, unit = "pt"))
      # subtitle = "from Financial Year 2020 to 2022"
      )

# Add text labels for all categories except "Other"
pie_chart <- pie_chart +
  geom_text(data = subset(sex_sum, gender != "Other"),  # Exclude "Other" here
            aes(label = paste0(round(search_percentage, 2), "%")), 
            position = position_stack(vjust = 0.5), 
            color = "black", size = 3)

# Then, overlay the text label for "Other" closer to the pie
pie_chart <- pie_chart +
  geom_text(data = subset(sex_sum, gender == "Other"),
            aes(label = paste0(round(search_percentage, 2), "%"), 
                x = 1.2, y = 0),  # Adjust position of "Other" label
            color = "black", size = 3)+
  theme(plot.caption = element_text(hjust = 0), # Left-align the text within the caption
    plot.caption.position = "plot",
    plot.margin = margin(0.5, 0.5, 0.5, 0.5, "cm"))
  

# Display the pie chart
# pie_chart


# Filter out the "Other" category for displaying 
filtered_sex_sum <- sex_sum %>% 
  filter(gender != "Other") %>% 
  droplevels()  # Important: This drops unused factor levels


# Define colors for each ethnicity
base_colors <- c("Male" = "#1f77b4", 
                 "Female" = "#d62728")

# Create a named vector for alpha values to be used for the legend
alpha_values <- c("Linked to search object" = 1, "No Link" = 0.45)

# Create the bar chart
bar_chart <- ggplot(filtered_sex_sum, aes(x = gender, y = number_of_searches, fill = gender)) +
 geom_bar(stat = "identity", aes(alpha = "No Link")) +
  geom_bar(stat = "identity", 
           aes(y = number_of_searches * reasonable_search_percentage / 100, alpha = "Linked to search object")) +
  geom_text(aes(y = number_of_searches * reasonable_search_percentage / 100, 
                label = paste0(round(reasonable_search_percentage, 2), "%")), 
            vjust = 1.3, size=2.8,
            color = "black") +
  geom_text(aes(label = number_of_searches), 
            position = position_stack(vjust = 1.1), 
            color = "black") +
  scale_y_continuous(labels = label_number())+
  scale_fill_manual(values = base_colors) +
  scale_alpha_manual(values = alpha_values, 
                     guide = guide_legend(title = "Outcome")) +
  theme_minimal() +
  labs( 
    #theme(plot.title = element_text(size = 8)),
    #caption = "Note: \n• Does not include vehicle only searches.\n• Does not include other sexes. \n• Does not include searches record where sex is unkown",
    subtitle = "Total Number of Stop and Searches by Sex ",
        #subtitle = "in financial year with outcome-linked searches percentage",
       x = "Gender", y = "Total Number of Searches", fill = "Gender") +
  theme(axis.text.x = element_text( size = 8),
        plot.subtitle = element_text(hjust = 0.9,vjust=2))+
 theme(legend.text = element_text(size = 8),  # Adjust text size
        legend.key.size = unit(0.5, "cm"))+
  guides(fill = guide_legend(nrow = 1, title = "Gender"),
         alpha = guide_legend(nrow = 1, title = "Outcome")) +
  theme(legend.position = "bottom",
        legend.box = "vertical",
        ) 

#bar_chart

grid.arrange(pie_chart, bar_chart, ncol = 2,  widths = c(0.95, 1),  top = textGrob("Search Against Gender from 2021 to 2023 with Outcome-linked Percentage ", 
                                                            gp = gpar(fontface = "bold", fontsize = 14)))

library(ggplot2)
library(scales)  


# Filter out the "Other" category
filtered_sex_df_year <- sex_df_year %>% 
  filter(gender != "Other") %>% 
  droplevels()  # Important: This drops unused factor levels

# Convert 'year' to a factor to ensure proper dodging
filtered_sex_df_year$year <- factor(filtered_sex_df_year$year)

# Plotting the bar chart
ggplot(filtered_sex_df_year, aes(x = gender, y = searches_per_1000_population_per_year, fill = year)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  geom_text(
    aes(label = paste0(round(searches_per_1000_population_per_year, 2)), group = year),
    vjust = -0.3,
    position = position_dodge(width = 0.9),
    color = "black",
    size = 3
  ) +
  scale_y_continuous(labels = scales::comma) +
  labs(
    caption = "Note: \n• Does not include vehicle only searches.\n• Does not include other sexes. \n• Does not include searches record where sex is unkown",
    y = "Number of Searches per 1000 Population",x="Gender",
    title = "Search Rate: Number of Searches per 1000 Population by Ethnicity",
    subtitle = "from Financial Year 2020 to 2022 using 2021 Census Data"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.title = element_text(hjust = 0.5),# Center-align the title
    plot.subtitle = element_text(hjust = 0.5),# Center-align the subtitle
    plot.caption = element_text(hjust = 0), # Left-align the text within the caption
    plot.caption.position = "plot",
    plot.margin = margin(0.5, 0.5, 0.5, 0.5, "cm")
  )
    

library(ggplot2)
library(scales)  # For formatting the y-axis labels


# Calculate the non-reasonable searches
filtered_sex_df_year$non_reasonable_searches <- filtered_sex_df_year$number_of_searches - (filtered_sex_df_year$number_of_searches * filtered_sex_df_year$reasonable_search_percentage / 100)

# Define alpha values for reasonable and non-reasonable searches
alpha_values <- c("Link to search object" = 1, "No link" = 0.5)  # Adjust non-reasonable alpha here

# Plotting the bar chart
ggplot(filtered_sex_df_year, aes(x = gender, fill = year)) +
  # Reasonable searches
  geom_bar(aes(y = number_of_searches * reasonable_search_percentage / 100, alpha = "Link to search object"), stat = "identity", position = "dodge") +
  # Non-reasonable searches (adjusted alpha value for better visibility)
  geom_bar(aes(y = non_reasonable_searches, alpha = "No link"), stat = "identity", position = "dodge") +
  # Labels for total number of searches
  geom_text(
    aes(y = number_of_searches, label = number_of_searches), 
    vjust = -0.3, 
    position = position_dodge(width = 0.9), 
    color = "black", 
    size = 3
  ) +
  # Labels for percentage of reasonable searches
  geom_text(
    aes(y = number_of_searches * reasonable_search_percentage / 100, label = paste0(round(reasonable_search_percentage, 2), "%")), 
    vjust = -0.5, 
    position = position_dodge(width = 0.9), 
    color = "black", 
    size = 2.5
  ) +
  scale_y_continuous(labels = scales::comma) +
  scale_alpha_manual(values = alpha_values, guide = guide_legend(title = "Search Outcome")) +
  labs(caption = "Note: \n• Does not include vehicle only searches.\n• Does not include other sexes. \n• Does not include searches record where sex is unkown",x = "Gender Group", y = "Number of Searches", title = "Number of Stop and Searches by Gender from Financial Year 2020 to 2022") +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.caption = element_text(hjust = 0),
    plot.caption.position = "plot",
      plot.margin = margin(0.5, 0.5, 0.5, 0.5, "cm"))



library(ggplot2)
library(scales)
library(tidyr)
library(dplyr)

sex_df_year$year <- as.numeric(as.character(sex_df_year$year))
# Reshape the data to long format
long_sex_df_year <- sex_df_year %>%
  pivot_longer(
    cols = c(searches_per_1000_population_per_year, number_of_searches),
    names_to = "metric",
    values_to = "value"
  )
long_sex_df_year <- long_sex_df_year %>%
  filter(!is.na(year) & !is.na(value) & !is.na(gender))

# Plotting the line graph
line_chart <- ggplot(long_sex_df_year, aes(x = year, y = value, color = gender, group = gender)) +
  geom_line() +
  geom_point() + # Add points to the line
  geom_text(
    aes(label = paste0(round(value, 2))),
    vjust = -0.5,  # Adjust this value to position the text above the points
    color = "black", 
    size = 2.1,
    check_overlap = TRUE  # Avoid overlapping text
  ) +
  scale_x_continuous(breaks = unique(sex_df_year$year)) + # Set breaks at unique years
  scale_y_continuous(labels = scales::comma) +
  facet_wrap(~ metric, scales = "free_y") +
  labs(
    caption = "Note: \n• Does not include vehicle only searches. \n• Does not include searches record whose ethnicity is unknown.",
    x = "Year", y = "Value", 
    title = "Search Rate and Total Number of Searches per 1000 Population by Gender",
    subtitle = "from Financial Year 2020 to 2022 using 2021 Census Data"
  ) +
  theme_minimal() +
  theme(
    legend.position = "bottom",
    plot.subtitle = element_text(hjust = 0.5),# Center-align the subtitle
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.title = element_text(hjust = 0),   
    plot.caption = element_text(hjust = 0),
    plot.caption.position = "plot",
    plot.margin = margin(0.5, 0.5, 0.5, 0.5, "cm")
  )

# Display the line graph
print(line_chart)

unique_values_in_ethnicity <- unique(stop_and_searches_2021_2023$combined_ethnicity)
# unique_values_in_ethnicity

ethnicity_df_year <- dbGetQuery(db, "
 SELECT 
    year,
    combined_ethnicity as ethnicity,
    
    COUNT(*) AS number_of_searches,
    
    (COUNT(*) * 1000.0) / 
     ((SELECT SUM(REPLACE(p.Population, ',', '') + 0)
     FROM population_by_force_by_ethnicity_21census p 
       WHERE p.combinedethnicity = ss.combined_ethnicity) )AS searches_per_1000_population_per_year,
       
    (COUNT(*) * 100.0) / (SELECT COUNT(*) 
    FROM stop_and_searches_2021_2023 
    WHERE year = ss.year AND combined_ethnicity IS NOT NULL) AS search_percentage,
    
    SUM(CASE WHEN outcome_linked_to_object_of_search = TRUE THEN 1 ELSE 0 END) *100/ NULLIF(COUNT(*), 0)  AS percentage_reasonable
    
FROM stop_and_searches_2021_2023 ss
WHERE combined_ethnicity IS NOT NULL
GROUP BY year,combined_ethnicity
")
# ethnicity_df_year
ethnicity_df <- dbGetQuery(db, "
 SELECT 
    combined_ethnicity as ethnicity,
    COUNT(*) AS number_of_searches,
    COUNT(*)/3 AS average_number_of_searches_per_year,
    
    (COUNT(*) * 1000.0) / 
    (3*(SELECT SUM(REPLACE(p.Population, ',', '') + 0)FROM population_by_force_by_ethnicity_21census p 
       WHERE p.combinedethnicity = ss.combined_ethnicity) )AS searches_per_1000_population_per_year,
      
    (COUNT(*) * 100.0) / (SELECT COUNT(*) 
    FROM stop_and_searches_2021_2023 
    WHERE combined_ethnicity IS NOT NULL) AS search_percentage,
    
    SUM(CASE WHEN outcome_linked_to_object_of_search = TRUE THEN 1 ELSE 0 END) *100/ 
    NULLIF(COUNT(*), 0)  AS percentage_reasonable
      
    
FROM stop_and_searches_2021_2023 ss
WHERE combined_ethnicity IS NOT NULL
GROUP BY combined_ethnicity
")
# ethnicity_df


library(ggplot2)
library(scales)
# Assuming ethnicity_df is your dataframe already loaded in R

# Pie Chart for search_percentage with labels
pie_chart <- ggplot(ethnicity_df, aes(x = "", y = search_percentage, fill = ethnicity)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y", start = 0) +
  geom_text(aes(label = paste0(round(search_percentage, 1), "%")), 
            position = position_stack(vjust = 0.5), 
            color = "black",size=2.4) +
  theme_void() +
 scale_fill_brewer(palette = "Set1", name = "Ethnicity", labels = paste(ethnicity_df$ethnicity, ": ", round(ethnicity_df$search_percentage, 2),
"%"))+
  labs(caption = "Note: \n• Does not include vehicle only searches. \n• Does not include searches record whose ethnicity is unkown.",
       fill = "Ethnicity", 
       title = "Percentage of Searches by Ethnicity from Year 2021 to 2023")+
  theme(
    plot.caption = element_text(hjust = 0),  # Left-align the text within the caption
    plot.caption.position = "plot"  # Center the caption box itself at the bottom of the plot
  ) +
  theme(plot.margin = margin(1, 1, 1, 1, "cm"))# Center-align the subtitle

pie_chart
# Define colors for each ethnicity
base_colors <- c("Asian or Asian British" = "#d62728", 
                 "Black or Black British" = "#1f77b4", 
                 "Other Ethnic Group" = "purple", 
                 "White" = "#ff7f0e",
                 "Mixed" = "#2ca02c")

# Create a named vector for alpha values to be used for the legend
alpha_values <- c("Linked to search object" = 1, "No Link" = 0.45)

# Create the bar chart
bar_chart <- ggplot(ethnicity_df, aes(x = ethnicity, y = number_of_searches, fill = ethnicity)) +
  geom_bar(stat = "identity", aes(alpha = "No Link")) +
  geom_bar(stat = "identity", 
           aes(y = number_of_searches * percentage_reasonable / 100, alpha = "Linked to search object")) +
  # Labels for percentage of reasonable searches
  geom_text(aes(y = number_of_searches * percentage_reasonable / 100, 
                label = paste0(round(percentage_reasonable, 2), "%")), 
            vjust = 0.8, 
            color = "black",size=2) +
  # Labels for total number of searches)
  geom_text(aes(label = number_of_searches), 
            vjust = -0.05, 
            color = "black",size=3) +
  scale_y_continuous(labels = comma) +
  scale_fill_manual(values = base_colors) +
  scale_alpha_manual(values = alpha_values, 
                     guide = guide_legend(title = "Search Outcome")) +
  theme_minimal() +
  labs(caption = "Note: \n• Does not include vehicle only searches. \n• Does not include searches record whose ethnicity is unkown.",
       title = "Total Number of Stop and Searches by Ethnicity from 2021 to 2023",
        subtitle = "with outcome-linked searches percentage",
       x = "Ethnicity", y = "Number of Searches", fill = "Ethnicity") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
        plot.subtitle = element_text(hjust = 0.5))+ # Center-align the subtitle
  theme(plot.subtitle = element_text(hjust = 0.5),# Center-align the subtitle
    plot.caption = element_text(hjust = 0),  # Left-align the text within the caption
    plot.caption.position = "plot"  # Center the caption box itself at the bottom of the plot
  ) +
  
  theme(plot.margin = margin(0.5,0.5,  0.5,  0.5, "cm"))# Center-align the subtitle

bar_chart

library(ggplot2)
library(scales)  # For formatting the y-axis labels



# Convert 'year' to a factor to ensure proper dodging
ethnicity_df_year$year <- factor(ethnicity_df_year$year)

# Plotting the bar chart
ggplot(ethnicity_df_year, aes(x = ethnicity, y = searches_per_1000_population_per_year, fill = year)) +
  geom_bar(stat = "identity", position = "dodge") +
  geom_text(
    aes(label = paste0(round(searches_per_1000_population_per_year, 2)), group = year),
    #aes(label = as.integer(searches_per_1000_population_per_year)),  # Format as integer
    vjust = -0.3, 
    position = position_dodge(width = 0.9), 
    color = "black", 
    size = 3  # Adjust the text size here
  ) +
  scale_y_continuous(labels = scales::comma) +  # Use comma formatting for y-axis
  labs(caption = "Note: \n• Does not include vehicle only searches. \n• Does not include searches record whose ethnicity is unkown.",x = "Ethnic Group", y = "Number of Searches per 1000 Population", title = "Search Rate: Number of Searches per 1000 Population by Ethnicity",
       subtitle = "from Financial Year 2021 to 2023 using 2021 Census Data") +
  theme_minimal() +
  theme(plot.caption = element_text(hjust = 0.5),
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.title = element_text(hjust = 0.5),   # Center-align the title
  )+
  theme(
    plot.caption = element_text(hjust = 0),  # Left-align the text within the caption
    plot.caption.position = "plot"  # Center the caption box itself at the bottom of the plot
  ) +
  theme(plot.margin = margin(0.5, 0.5, 0.5, 0.5, "cm"))# Center-align the subtitle

library(ggplot2)
library(scales)  # For formatting the y-axis labels

# Convert 'year' to a factor to ensure proper dodging
ethnicity_df_year$year <- factor(ethnicity_df_year$year)

# Calculate the non-reasonable searches
ethnicity_df_year$non_reasonable_searches <- ethnicity_df_year$number_of_searches - (ethnicity_df_year$number_of_searches * ethnicity_df_year$percentage_reasonable / 100)

# Define alpha values for reasonable and non-reasonable searches
alpha_values <- c("Linked to search object" = 1, "No link" = 0.5)  # Adjust non-reasonable alpha here

# Plotting the bar chart
ggplot(ethnicity_df_year, aes(x = ethnicity, fill = year)) +
  # Reasonable searches
  geom_bar(aes(y = number_of_searches * percentage_reasonable / 100, alpha = "Linked to search object"), stat = "identity", position = "dodge") +
  # Non-reasonable searches (adjusted alpha value for better visibility)
  geom_bar(aes(y = non_reasonable_searches, alpha = "No link"), stat = "identity", position = "dodge") +
  # Labels for total number of searches
  geom_text(
    aes(y = number_of_searches, label = number_of_searches), 
    vjust = -0.4, 
    position = position_dodge(width = 0.9), 
    color = "black", 
    size = 2
  ) +
  # Labels for percentage of reasonable searches
  geom_text(
    aes(y = number_of_searches * percentage_reasonable / 100, label = paste0(round(percentage_reasonable, 2), "%")), 
    vjust = 0.7, 
    position = position_dodge(width = 0.9), 
    color = "black", 
    size = 2.5
  ) +
  scale_y_continuous(labels = scales::comma) +
  scale_alpha_manual(values = alpha_values, guide = guide_legend(title = "Search Outcome")) +
  labs(caption = "Note: \n• Does not include vehicle only searches.\n• Does not include other sexes. \n• Does not include searches record where sex is unkown",x = "Ethnic Group", y = "Number of Searches", title = "Total Number of Stop and Searches by Ethnicity ", subtitle="from Year 2021 to 2023 with outcome-linked search percentage") +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1)
  )+
  theme(plot.subtitle = element_text(hjust = 0.5),# Center-align the subtitle
    plot.caption = element_text(hjust = 0),  # Left-align the text within the caption
    plot.caption.position = "plot"  # Center the caption box itself at the bottom of the plot
  ) +
  theme(plot.margin = margin(0.5, 0.5, 0.5, 0.5, "cm"))# Center-align the subtitle



 library(ggplot2)
library(scales)
library(tidyr)
library(dplyr)

ethnicity_df_year$year <- as.numeric(as.character(ethnicity_df_year$year))
# Reshape the data to long format
long_ethnicity_df_year <- ethnicity_df_year %>%
  pivot_longer(
    cols = c(searches_per_1000_population_per_year, number_of_searches),
    names_to = "metric",
    values_to = "value"
  )

# Plotting the line graph
line_chart <- ggplot(long_ethnicity_df_year, aes(x = year, y = value, color = ethnicity, group = ethnicity)) +
  geom_line() +
  geom_point() + # Add points to the line
  geom_text(
    aes(label = paste0(round(value, 2))),
    vjust = -0.5,  # Adjust this value to position the text above the points
    color = "black", 
    size = 2.1,
    check_overlap = TRUE  # Avoid overlapping text
  ) +
  scale_x_continuous(breaks = unique(long_ethnicity_df_year$year)) + # Set breaks at unique years
  scale_y_continuous(labels = scales::comma) +
  facet_wrap(~ metric, scales = "free_y") +
  labs(
    caption = "Note: \n• Does not include vehicle only searches. \n• Does not include searches record whose ethnicity is unknown.",
    x = "Year", y = "Value", 
    title = "Search Rate and Total Number of Searches per 1000 Population by Ethnicity",
    subtitle = "from Year 2021 to 2023 using 2021 Census Data"
  ) +
  theme_minimal() +
  theme(plot.subtitle = element_text(hjust = 0.5),# Center-align the subtitle
    legend.position = "bottom",
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.title = element_text(hjust = 0.5),   
    plot.caption = element_text(hjust = 0),
    plot.margin = margin(0.5, 0.5, 0.5, 0.5, "cm")
  )

# Display the line graph
print(line_chart)


unique_values_in_age <- unique(stop_and_searches_2021_2023$age_range)
#unique_values_in_age

age_sum <- dbGetQuery(db, "
 SELECT 
    age_range,
    COUNT(*) AS number_of_searches,
    
    COUNT(*)/3 AS average_number_of_searches,
    
    (COUNT(*) * 1000.0) / 
     (3* (SELECT SUM(REPLACE(page.Population, ',', '') + 0)FROM age_ethnicity page
       WHERE page.age_group = ss.age_range) )AS searches_per_1000_population,
       
    (COUNT(*) * 100.0) / (SELECT COUNT(*) 
    FROM stop_and_searches_2021_2023 
    WHERE age_range IS NOT NULL ) AS search_percentage,
    
    (SUM(CASE WHEN outcome_linked_to_object_of_search = TRUE THEN 1 ELSE 0 END)*100) / 
    NULLIF(COUNT(*), 0)  AS percentage_reasonable
    
FROM stop_and_searches_2021_2023 ss
WHERE age_range IS NOT NULL
GROUP BY age_range
")

age_df_year <- dbGetQuery(db, "
 SELECT 
   year,
    age_range,
    COUNT(*) AS number_of_searches,
    
    (COUNT(*) * 1000.0) / 
     ((SELECT SUM(REPLACE(page.Population, ',', '') + 0)
    FROM age_ethnicity page
       WHERE page.age_group = ss.age_range) ) AS searches_per_1000_population_per_year,
       
    (COUNT(*) * 100.0) / (SELECT COUNT(*) 
    FROM stop_and_searches_2021_2023 
    WHERE year = ss.year AND age_range IS NOT NULL ) AS search_percentage,
    
    (SUM(CASE WHEN outcome_linked_to_object_of_search = TRUE THEN 1 ELSE 0 END)*100) / NULLIF(COUNT(*), 0) 
    AS percentage_reasonable
   
FROM stop_and_searches_2021_2023 ss
WHERE age_range IS NOT NULL
GROUP BY year,age_range
")
age_sum

age_df_year
library(ggplot2)
library(scales)
# Assuming ethnicity_df is your dataframe already loaded in R

# Pie Chart for search_percentage with labels
pie_chart <- ggplot(age_sum, aes(x = "", y = search_percentage, fill = age_range)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y", start = 0) +
  geom_text(aes(label = paste0(round(search_percentage, 2), "%")), 
            position = position_stack(vjust = 0.5), 
            color = "black") +
  theme_void() +
  scale_fill_brewer(palette = "Set1", name = "Age Group", labels = paste(age_sum$age_range, ": ", round(age_sum$search_percentage, 2),
"%"))+
 theme(
    plot.caption = element_text(hjust = 0),
    plot.caption.position = "plot",
    plot.margin = margin(1, 1, 1, 1, "cm")  # Corrected quotation marks
  )+
  labs(caption = "Note: \n• Does not include vehicle only searches.\n• Does not include searches record where age is unkown. ",x = "Age Group", y = "Number of Searches", title = "Percentage of Stop and Searches by Age Group from 2021 to 2023")

pie_chart
# Define colors for each ethnicity
base_colors <- c("under 10"="#ff7f0e",
                 "10-17" = "#d62728", 
                 "18-24" ="#1f77b4", 
                 "25-34" = "#2ca02c", 
                 "over 34" = "purple"
                 )

# Create a named vector for alpha values to be used for the legend
alpha_values <- c("Link to search object" = 1, "No link" = 0.45)



# Create the bar chart
bar_chart <- ggplot(age_sum, aes(x = age_range, y = number_of_searches, fill = age_range)) +
  geom_bar(stat = "identity", aes(alpha = "No link")) +
  geom_bar(stat = "identity", 
           aes(y = number_of_searches * percentage_reasonable / 100, alpha = "Link to search object")) +
  geom_text(aes(y = number_of_searches * percentage_reasonable / 100, 
                label = paste0(round(percentage_reasonable, 2), "%")), 
            vjust = 1, 
            color = "black") +
  geom_text(aes(label = number_of_searches), 
            vjust = -0.45, size=2.8,
            color = "black") +
  scale_y_continuous(labels = comma) +
  scale_fill_manual(values = base_colors) +
  scale_alpha_manual(values = alpha_values, 
                     guide = guide_legend(title = "Outcome")) +
  theme_minimal() +
  labs(caption = "Note: \n• Does not include vehicle only searches.\n• Does not include searches record where age is unkown.", title = "Total Number of Stop and Searches by Age Group from 2021 to 2023",
        subtitle = "with outcome-linked searches percentage",
       x = "Age", y = "Number of Searches", fill = "Age") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
        plot.subtitle = element_text(hjust = 0.5),
        plot.caption = element_text(hjust = 0)) # Center-align the subtitle

bar_chart

library(ggplot2)
library(scales)  # For formatting the y-axis labels


# Convert 'year' to a factor to ensure proper dodging
age_df_year$year <- factor(age_df_year$year)

# Plotting the bar chart
ggplot(age_df_year, aes(x = age_range, y = searches_per_1000_population_per_year, fill = year)) +
  geom_bar(stat = "identity", position = "dodge") +
  geom_text(
    aes(label = as.integer(searches_per_1000_population_per_year)),  # Format as integer
    vjust = -0.3, 
    position = position_dodge(width = 0.9), 
    color = "black", 
    size = 3  # Adjust the text size here
  ) +
  scale_y_continuous(labels = scales::comma) +  # Use comma formatting for y-axis
  labs(x = "Age Group", y = "Number of Searches per 1000 Population", title = "Number of Searches per 1000 Population by Age Group",
       subtitle = "from  Year 2021 to 2023 using 2021 Census Data") +
  theme_minimal() +
  theme(plot.caption = element_text(hjust = 0),
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.title = element_text(hjust = 0.5),   # Center-align the title
    plot.subtitle = element_text(hjust = 0.5) # Center-align the subtitle
  )+labs(
    caption =  "Note: \n• Does not include vehicle only searches.\n• Does not include searches record where age is unkown."
  )

library(ggplot2)
library(scales)  # For formatting the y-axis labels


# Convert 'year' to a factor to ensure proper dodging
age_df_year$year <- factor(age_df_year$year)
# Calculate the non-reasonable searches
age_df_year$non_reasonable_searches <- age_df_year$number_of_searches - (age_df_year$number_of_searches * age_df_year$percentage_reasonable / 100)

# Define alpha values for reasonable and non-reasonable searches
alpha_values <- c("Reasonable" = 1, "Non-Reasonable" = 0.5)  # Adjust non-reasonable alpha here

# Plotting the bar chart
ggplot(age_df_year, aes(x = age_range, fill =year)) +
  # Reasonable searches
  geom_bar(aes(y = number_of_searches * percentage_reasonable / 100, alpha = "Reasonable"), stat = "identity", position = "dodge") +
  # Non-reasonable searches (adjusted alpha value for better visibility)
  geom_bar(aes(y = non_reasonable_searches, alpha = "Non-Reasonable"), stat = "identity", position = "dodge") +
  # Labels for total number of searches
  geom_text(
    aes(y = number_of_searches, label = number_of_searches), 
    vjust = -0.9, 
    position = position_dodge(width = 0.9), 
    color = "black", 
    size = 2.1
  ) +
  # Labels for percentage of reasonable searches
  geom_text(
    aes(y = number_of_searches * percentage_reasonable / 100, label = paste0(round(percentage_reasonable, 2), "%")), 
    vjust = 0.5, 
    position = position_dodge(width = 0.9), 
    color = "black", 
    size = 2.5
  ) +
  scale_y_continuous(labels = scales::comma) +
  scale_alpha_manual(values = alpha_values, guide = guide_legend(title = "Search Outcome")) +
  labs(caption = "Note: \n• Does not include vehicle only searches.\n• Does not include searches record where age is unkown.", x = "Age Group", y = "Number of Searches", title = "Number of Stop and Searches by Age Group from Year 2021 to 2023") +
  theme_minimal() +
  theme(
    plot.caption = element_text(hjust = 0),
    axis.text.x = element_text(angle = 45, hjust = 1)
  )


 library(ggplot2)
library(scales)
library(tidyr)
library(dplyr)

age_df_year$year <- as.numeric(as.character(age_df_year$year))
# Reshape the data to long format
long_age_df_year <- age_df_year %>%
  pivot_longer(
    cols = c(searches_per_1000_population_per_year, number_of_searches),
    names_to = "metric",
    values_to = "value"
  )

# Plotting the line graph
line_chart <- ggplot(long_age_df_year, aes(x = year, y = value, color = age_range, group = age_range)) +
  geom_line() +
  geom_point() + # Add points to the line
  geom_text(
    aes(label = paste0(round(value, 2))),
    vjust = -0.5,  # Adjust this value to position the text above the points
    color = "black", 
    size = 2.1,
    check_overlap = TRUE  # Avoid overlapping text
  ) +
  scale_x_continuous(breaks = unique(long_age_df_year$year)) + # Set breaks at unique years
  scale_y_continuous(labels = scales::comma) +
  facet_wrap(~ metric, scales = "free_y") +
  labs(
    caption = "Note: \n• Does not include vehicle only searches.\n• Does not include searches record where age is unkown.",
    x = "Year", y = "Value", 
    title = "Search Rate and Total Number of Searches per 1000 Population by Age Group",
    subtitle = "from Financial Year 2020 to 2022 using 2021 Census Data"
  ) +
  theme_minimal() +
  theme(plot.subtitle = element_text(hjust = 0.5),# Center-align the subtitle
    legend.position = "bottom",
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.title = element_text(hjust = 0.5),   
    plot.caption = element_text(hjust = 0),
    plot.margin = margin(0.5, 0.5, 0.5, 0.5, "cm")
  )

# Display the line graph
print(line_chart)

# to get an overall view from census data against ethnicity
ethnicity_age<- dbGetQuery(db, "
 SELECT age_group,
 ethnicity_group, 
 SUM(observation) AS Populatoin
 FROM ethnicity_age_sex_census psex 
 GROUP BY Age_Group, ethnicity_group
")
ethnicity_age
ethnicity_age_sum <- dbGetQuery(db, "
 SELECT 
    age_range,
    combined_ethnicity,
    COUNT(*) AS number_of_searches,
    
    COUNT(*)/3 AS average_number_of_searches,
    
    (COUNT(*) * 1000.0) / 
     (3* (SELECT SUM(REPLACE(pae.Population, ',', '') + 0)FROM age_ethnicity pae
       WHERE pae.age_group = ss.age_range AND pae.ethnicity = ss.combined_ethnicity) )AS searches_per_1000_population,
       
     (COUNT(*) * 100.0) / (SELECT COUNT(*) FROM stop_and_searches_2021_2023 WHERE age_range IS NOT NULL  AND combined_ethnicity IS NOT NULL) AS search_percentage,
    
    (SUM(CASE WHEN outcome_linked_to_object_of_search =TRUE THEN 1 ELSE 0 END) *100)/ NULLIF(COUNT(*), 0) AS percentage_reasonable
    
FROM stop_and_searches_2021_2023 ss
WHERE age_range IS NOT NULL  AND combined_ethnicity IS NOT NULL
GROUP BY age_range,combined_ethnicity
")

ethnicity_age_sum$searches_per_1000_population <- format(ethnicity_age_sum$searches_per_1000_population, scientific = FALSE)
ethnicity_age_sum$search_percentage <- format(ethnicity_age_sum$search_percentage, scientific = FALSE)
ethnicity_age_sum
ethnicity_age_df_year <- dbGetQuery(db, "
 SELECT 
    year,
    age_range,
    combined_ethnicity,
    COUNT(*) AS number_of_searches,
    
    (COUNT(*) * 1000.0) / 
     ((SELECT SUM(REPLACE(pae.Population, ',', '') + 0)FROM age_ethnicity pae 
       WHERE pae.age_group = ss.age_range AND pae.ethnicity = ss.combined_ethnicity) )AS searches_per_1000_population_per_year,
       
    (COUNT(*) * 100.0) / 
    (SELECT COUNT(*) FROM stop_and_searches_2021_2023 WHERE year = ss.year AND age_range IS NOT NULL AND combined_ethnicity IS NOT NULL ) AS search_percentage,
    
    (SUM(CASE WHEN outcome_Linked_to_object_of_search = TRUE THEN 1 ELSE 0 END)*100) / NULLIF(COUNT(*), 0)  AS percentage_reasonable
    
FROM stop_and_searches_2021_2023 ss
WHERE age_range IS NOT NULL  AND combined_ethnicity IS NOT NULL 
GROUP BY year,age_range,combined_ethnicity
")

ethnicity_age_df_year$search_percentage <- format(ethnicity_age_df_year$search_percentage, scientific = FALSE)
ethnicity_age_df_year
library(ggplot2)

ethnicity_age_sum$searches_per_1000_population <- as.numeric(ethnicity_age_sum$searches_per_1000_population)

# If there are no non-numeric characters, you can convert directly to numeric
# age_sex_sum$searches_per_1000_population <- as.numeric(age_sex_sum$searches_per_1000_population)

# Now round and convert to integer
ethnicity_age_sum$searches_per_1000_population <- as.integer(round(ethnicity_age_sum$searches_per_1000_population))

my_colors <- c("#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
# Plotting the bar chart
ggplot(ethnicity_age_sum, aes(x = age_range, y = average_number_of_searches, fill = combined_ethnicity)) +
  geom_bar(stat = "identity", position = "dodge", alpha = 1) +
  scale_fill_manual(values = my_colors) + 
  scale_y_continuous(labels = label_number(big.mark = ",")) +  # This will format numbers with commas
  labs(
    caption = "Note:\n• Does not include vehicle only searches.\n• Does not include searches record whose ethnicity is unknown.\n• Does not include searches record whose age is unknown.",
    title = "Average Number of Searches by Age Group and Ethnicity ",
    subtitle = "from year 2021 to 2023 based on 2021 Census",
    x = "Age Group",
    y = "Average Number of Searches"
  ) +
  theme_minimal()+
   theme(
    plot.caption = element_text(hjust = 0), # Left-align the text within the caption
    plot.caption.position = "plot", # Center the caption box itself at the bottom of the plot
    plot.subtitle = element_text(hjust = 0.5))# Center-align the subtitle


# Plotting the bar chart
ggplot(ethnicity_age_sum, aes(x = age_range, y = searches_per_1000_population, fill = combined_ethnicity)) +
  geom_bar(stat = "identity", position = "dodge", alpha = 1) +
  scale_fill_manual(values = my_colors) + 
  scale_y_continuous(labels = label_number(big.mark = ",")) +  # This will format numbers with commas
  labs(
    caption = "Note: \n• Does not include vehicle only searches.\n• Does not include searches record whose ethnicity is unknown.\n• Does not include searches record whose age is unknown.",
    title = "Average Searches per 1000 Population by Age Group and Ethnicity",
    subtitle = "from year 2021 to 2023 based on 2021 Census",
    x = "Age Group",
    y = "Searches per 1000 Population"
  ) +
  theme_minimal() +
  theme(
    plot.caption = element_text(hjust = 0), # Left-align the text within the caption
    plot.caption.position = "plot", # Center the caption box itself at the bottom of the plot
    plot.subtitle = element_text(hjust = 0.5))# Center-align the subtitle
# this part is I summarize the data from each database first to test whether I calculate write figure in the SQL
age_sex<- dbGetQuery(db, "
 SELECT age_group,
 sex, 
 SUM(observation) AS Populatoin
 FROM ethnicity_age_sex_census psex 
 GROUP BY age_group, sex
")

 
age_sex_census<- dbGetQuery(db, "
 SELECT age_range,
 gender, 
 COUNT(*) AS number_of_searches
 FROM stop_and_searches_2021_2023 
 GROUP BY age_range, gender
")
age_sex
age_sex_census
age_sex_sum <- dbGetQuery(db, "
 SELECT 
    age_range,
    gender,
    COUNT(*) AS number_of_searches,
    
    COUNT(*)/3 AS average_number_of_searches,
    
    (COUNT(*) * 1000.0) / 
     (3 * (SELECT SUM(REPLACE(pas.observation, ',', '') + 0)
           FROM ethnicity_age_sex_census pas 
           WHERE pas.age_group = ss.age_range AND pas.sex = ss.gender)
     ) AS searches_per_1000_population,
    
    (SELECT SUM(REPLACE(pas.observation, ',', '') + 0) 
     FROM ethnicity_age_sex_census pas 
     WHERE pas.age_group = ss.age_range AND pas.sex = ss.gender) * 100.0 / 
    (SELECT SUM(REPLACE(pas.observation, ',', '') + 0) 
     FROM ethnicity_age_sex_census pas) AS population_percentage,
    
    (COUNT(*) * 100.0) / 
    (SELECT COUNT(*) FROM stop_and_searches_2021_2023 WHERE age_range IS NOT NULL AND gender IS NOT NULL AND gender NOT IN ('Other')) AS search_percentage,
    
    (SUM(CASE WHEN outcome_Linked_to_object_of_search = TRUE THEN 1 ELSE 0 END) * 1.0) / NULLIF(COUNT(*), 0) * 100 AS percentage_reasonable
    
 FROM stop_and_searches_2021_2023 ss
 WHERE age_range IS NOT NULL AND gender IS NOT NULL AND gender NOT IN ('Other')
 GROUP BY age_range, gender
")

age_sex_sum$searches_per_1000_population <- format(age_sex_sum$searches_per_1000_population, scientific = FALSE)
age_sex_sum$search_percentage <- format(age_sex_sum$search_percentage, scientific = FALSE)
age_sex_sum

library(ggplot2)
library(dplyr)
library(tidyr)
age_sex_sum$population_percentage <- as.numeric(as.character(age_sex_sum$population_percentage))
age_sex_sum$search_percentage <- as.numeric(as.character(age_sex_sum$search_percentage))

# Convert the data to long format
df_long <- age_sex_sum %>%
  pivot_longer(cols = c(population_percentage, search_percentage),
               names_to = "metric", values_to = "percentage") %>%
  mutate(percentage = if_else(metric == "population_percentage", -percentage, percentage),
         metric = factor(metric, levels = c("population_percentage", "search_percentage"))) %>%
  unite("metric_gender", metric, gender, sep = "_") # Combine metric and sex into one column

# Define the width for dodging
dodge_width <- 0.9

# Plot
ggplot(df_long, aes(x = age_range, y = percentage, fill = metric_gender)) +
  geom_bar(stat = "identity", position = position_dodge(width = dodge_width)) +
  coord_flip() + 
  scale_y_continuous(labels = abs, 
                     breaks = seq(-100, 100, by = 10), 
                     limits = c(-max(abs(df_long$percentage)), max(abs(df_long$percentage)))) +
  scale_fill_manual(values = c("population_percentage_Female" = "red", 
                               "population_percentage_Male" = "blue",
                               "search_percentage_Female" = "lightpink",
                               "search_percentage_Male" = "lightblue")) +
  labs(x = "Age Group", y = "Percentage", fill = "Group",
        title = "Average Searches Percentage versus Population Percentage by Age Goup and Gender",
    subtitle = "from year 2021 to 2023 based on 2021 Census",
    caption = "Note: \n• Does not include vehicle only searches.\n• Does not include searches record whose age is unknown.\n• Does not include searches record whose gender is unknown.",) +
  theme_minimal() +
  
  theme(plot.subtitle = element_text(hjust = 0.5))+# Center-align the subtitle
  theme(legend.position = "right",  plot.title = element_text(size = 10))+
  theme(
    plot.caption = element_text(hjust = 0), # Left-align the text within the caption
    plot.caption.position = "plot" # Center the caption box itself at the bottom of the plot
  )




unique_values_in_force <- unique(stop_and_searches_2021_2023$force)


force_df_sum <- dbGetQuery(db, "
SELECT force FROM stop_and_searches_2021_2023 GROUP BY force")
# to assess data consistent
unique_values_in_force
force_df_sum
force_df_sum <- dbGetQuery(db, "
SELECT 
    force,
    COUNT(*) AS total_number_of_searches_21_23,
    COUNT(*) / 3 AS average_number_of_searches,
    (COUNT(*) * 1000.0) / (3 * (SELECT SUM(REPLACE(p.Population, ',', '') + 0) 
                               FROM population_by_force_by_ethnicity_21census p 
                               WHERE p.`Police.Force.Area` = ss.force)) 
    AS searches_per_1000_population,
    (COUNT(*) * 100.0) / (SELECT COUNT(*) 
                          FROM stop_and_searches_2021_2023 
                          WHERE year = ss.year  ) 
    AS search_percentage,
    (SUM(CASE WHEN outcome_Linked_to_object_of_search = TRUE THEN 1 ELSE 0 END) * 100.0) / 
    NULLIF(COUNT(*), 0) AS percentage_reasonable
FROM stop_and_searches_2021_2023 ss
GROUP BY ss.force

")
force_df_sum
force_df_year <- dbGetQuery(db, "
 SELECT
    force,
    COUNT(*) AS total_number_of_searches_21_23,
    
    (COUNT(*) * 1000.0) /  (SELECT SUM(REPLACE(p.Population, ',', '') + 0) 
                               FROM population_by_force_by_ethnicity_21census p 
                               WHERE p.`Police.Force.Area` = ss.force)
    AS searches_per_1000_population,
    
    
    (COUNT(*) * 100.0) / (SELECT COUNT(*) 
                          FROM stop_and_searches_2021_2023 
                          WHERE year = ss.year AND force IS NOT NULL ) 
    AS search_percentage,
    
    (SUM(CASE WHEN outcome_Linked_to_object_of_search = TRUE THEN 1 ELSE 0 END) * 100.0) / 
    NULLIF(COUNT(*), 0) AS percentage_reasonable
FROM stop_and_searches_2021_2023 ss
WHERE ss.force IS NOT NULL
GROUP BY year,ss.force")
force_df_year
library(sf)
library(dplyr)

# Define the path to your folder containing KML files
folder_path <- "force kmls"

# List all KML files in the folder
kml_files <- list.files(folder_path, pattern = "\\.kml$", full.names = TRUE)

# Read all KML files and combine them into one sf object
all_kml_data <- lapply(kml_files, st_read) %>% 
  bind_rows()

# Now  can plot all the boundaries
#ggplot() +
  geom_sf(data = all_kml_data)


# download and import the force boundary data from police.data.uk
library(sf)
library(dplyr)

# Define the path to your folder containing KML files
folder_path <- "force kmls"

# List all KML files in the folder
kml_files <- list.files(folder_path, pattern = "\\.kml$", full.names = TRUE)

# Read all KML files, assign the file name (without .kml extension) to the 'Name' column, and combine into one sf object
all_kml_data <- lapply(kml_files, function(file) {
  kml_data <- st_read(file)
  # Remove the .kml extension from the file name
  kml_data$Name <- gsub(".kml", "", basename(file), fixed = TRUE)
  return(kml_data)
}) %>% bind_rows()

# Now we can plot all the boundaries with names filled in
ggplot() +  geom_sf(data = all_kml_data, aes(fill = Name))  # Fill based on the Name column

#store at local 
saveRDS(all_kml_data, "all_kml_data.rds")

all_kml_data <- readRDS("all_kml_data.rds")
# Unique names from all_kml_data_update
unique_names <- unique(all_kml_data$Name)

# Unique police_force_area from force_df_sum
unique_police_force_areas <- unique(force_df_sum$force)

# Check unmatched names in both directions
unmatched_names_in_force_df_sum <- unique_police_force_areas[!unique_police_force_areas %in% unique_names]
unmatched_names_in_all_kml_data <- unique_names[!unique_names %in% unique_police_force_areas]

# Print unmatched names
print("Unmatched Names in force_df_sum:")
print(unmatched_names_in_force_df_sum)

print("Unmatched Names in all_kml_data:")
print(unmatched_names_in_all_kml_data)
## This is a sample indication of how t use downloaded archive data as additional data ( partial date from Gwent and North yorkshire )for the research, however running this chunk directly would report error as the total amount of data rows in more rows than dplyr can handle, which mean there are some manipulation needed to do handle this,  I give a sample to use filter function to remove rows contain NA. Another methods is to group data and add a column called search, which contains the number of search in same defined search category. I would show how to edit the code in the appendix with that format statistical data from financial year 2020 to 2022 ( Mar 2021 to Mar 2023.)

library(readr)
library(dplyr)

# Function to read all CSVs in a folder, add a modified 'force' column based on the file name
read_csvs_and_add_modified_force <- function(folder) {
  csv_files <- list.files(folder, pattern = "\\.csv$", full.names = TRUE)

  if (length(csv_files) == 0) {
    message("No CSV files found in ", folder)
    return(NULL)  # Return NULL if no files are found
  }

  all_data <- lapply(csv_files, function(file) {
    data <- read_csv(file)
    # Extract the relevant part of the filename for 'force'
    force_name <- sub("^\\d{4}-\\d{2}-(.*)-stop-and-search.*$", "\\1", basename(file))
    data <- mutate(data, force = force_name)
    return(data)
  })

  return(bind_rows(all_data))
}

# Folder path
folder_path <- "additional data"  # Replace with your actual folder path

# Read and combine all CSVs from the folder
combined_data <- read_csvs_and_add_modified_force(folder_path)

colnames(combined_data)
colnames(stop_and_searches_2021_2023)


# Renaming columns in combined_data to match those in stop_and_searches_2021_2023
combined_data <- combined_data %>%
  rename(
    type = Type,
    datetime = Date,
    location.latitude = Latitude,
    location.longitude = Longitude,
    gender = Gender,
    age_range = `Age range`,
    self_defined_ethnicity = `Self-defined ethnicity`,
    officer_defined_ethnicity = `Officer-defined ethnicity`,
    legislation = Legislation,
    object_of_search = `Object of search`,
    outcome_object.name = Outcome,
    outcome_linked_to_object_of_search = `Outcome linked to object of search`,
    force = force
  )

##
combined_data_filtered <- combined_data %>%
  filter( !is.na(gender) & !is.na(age_range) )


# Joining combined_data and stop_and_searches_2021_2023
# Replace 'type' with the appropriate column(s) for joining
joined_data <- inner_join(combined_data, stop_and_searches_2021_2023, by = "type")



map_to_plot<- right_join(force_df_sum , all_kml_data, by = c("force" = "Name"))
# After the join, map_to_plot will include all rows from force_df_sum and matching rows from all_kml_data_update, including the one with empty geometry.



# This will remove rows where the geometry is empty (including 'MULTIPOLYGON Z EMPTY')
map_to_plot_filtered<- map_to_plot %>%
  filter(!st_is_empty(geometry))
library(sf)
# Check if map_to_plot is still an sf object
#class(map_to_plot_filtered)

# If map_to_plot is not an sf object,I  need to convert it back

if (!"sf" %in% class(map_to_plot_filtered)) {
  map_to_plot_filtered <- st_as_sf(map_to_plot_filtered)
}


# Check the class again
#class(map_to_plot_filtered)

# Check rows with empty geometry
rows_with_empty_geometry <- map_to_plot_filtered[which(st_is_empty(st_geometry(map_to_plot_filtered))), ]

rows_with_empty_geometry <- map_to_plot_filtered[which(st_is_empty(st_geometry(map_to_plot_filtered))), ]

# Count the number of such rows
num_rows_with_empty_geometry <- nrow(rows_with_empty_geometry)

# Print the count
print(paste("Number of rows with 'MULTIPOLYGON Z EMPTY' geometry:", num_rows_with_empty_geometry))

# Optionally, view the rows with empty geometry
if (num_rows_with_empty_geometry > 0) {
  print(rows_with_empty_geometry$police_force_area)
}

library("sf")
library("tmap")
library("classInt")
library(sf)
library(ggplot2)
library(dplyr)
install.packages("ggspatial")
library(ggspatial)
# Identify the highest value and second highest value
highest_value <- max(map_to_plot_filtered$searches_per_1000_population, na.rm = TRUE)
second_highest_value <- sort(map_to_plot_filtered$searches_per_1000_population, decreasing = TRUE)[2]
highest_location <- st_centroid(map_to_plot_filtered[which.max(map_to_plot_filtered$searches_per_1000_population), ])

# Plot with ggplot
map_rate <- ggplot(map_to_plot_filtered) +
  geom_sf(aes(fill = searches_per_1000_population)) +
  geom_sf(data = highest_location, color = "red", size = 3, shape = 8) + # Shape 8 is a star
  scale_fill_viridis_c(
    name = "Relative Search Rate per 1000 Population",
    limits = c(0, second_highest_value),  # Set the limits excluding the highest value
    option = "D"
  ) +
  labs(title = "Average Search Rate Distribution by Police Force from 2021 to 2023",
       
    caption = "Note: \n• The red star indicates the force area with the highest value of searches per 1000 population:City of London at 295. \n• Grey Area indicates where stop and search data is missing due to technical issues from DATA.POLICE.UK"
  ) +
  theme_minimal() +
  theme(legend.position = "right", plot.caption = element_text(hjust = 0))+
  annotation_scale(location = "bl")
  

# Print the map
print(map_rate)

library("plotly")
#ggplotly(map_rate) #interactive map plot to help discover regional difference in detail


library(ggplot2)
library(sf)
library(viridis)

library(ggplot2)
library(sf)


# Identify the highest value and its index
highest_value <- max(map_to_plot_filtered$average_number_of_searches, na.rm = TRUE)
highest_index <- which.max(map_to_plot_filtered$average_number_of_searches)

# Create a subset for the highest value area
highest_area <- map_to_plot_filtered[highest_index, ]

# Plot with ggplot
map_rate <- ggplot(map_to_plot_filtered) +
  geom_sf(aes(fill = average_number_of_searches), lwd = 0.2) + # Fill based on the average number of searches
  scale_fill_viridis_c(
    name = "Average Number of Searches",
    option = "D",
    limits = c(0, 40000), # Set fixed limits for the color scale
    breaks = c(0, 10000, 20000, 30000, 40000), # Set breaks for the legend
    oob = scales::oob_squish # Keep out-of-bounds data points
  ) +
  geom_sf(data = highest_area, fill = "red", lwd = 0.2) + # Overlay the highest area in red
  labs(title = "Average Search Frequency Distribution by Police Force from 2021 to 2023",
    caption = paste("Note: \n• The red area indicates the force area with the highest average searches from 2021 to 2023:",
                    "City of London at", highest_value, ".\n• Grey Area indicates where stop and search data is missing due to technical issues from DATA.POLICE.UK")
  ) +
  theme_minimal() +
  theme(legend.position = "right", plot.caption = element_text(hjust = 0)) +
  annotation_scale(location = "bl")

# Print the map
print(map_rate)

library("plotly")
# ggplotly(map_rate) #interactive map plot to help discover regional difference in detail

# Alternative methods to use tamp to draw the graph with interactive map we could directly identify the force and its figure to help write report

# Load required libraries
library(sf)
library(tmap)
library(dplyr)
map <- tm_shape(map_to_plot_filtered) +
  tm_polygons("average_number_of_searches")

# Let's improve the map by creating more visual variation with our bins: 
breaks <- classInt::classIntervals(map_to_plot_filtered$average_number_of_searches, n = 5, style = "jenks", dig.lab=10)
breaks

map_to_plot_filtered <- map_to_plot_filtered |>
  mutate(number_of_searches_cut = cut(average_number_of_searches, breaks$brks, include.lowest = TRUE, dig.lab=10))

map <- tm_shape(map_to_plot_filtered) +
  tm_polygons("average_number_of_searches", palette = "Reds", title = "Average Number of Searches") +
  tm_layout(main.title = " UK Police Stop and Search, 2020-2023", legend.outside = TRUE, frame = FALSE)+
  tm_compass(position = c("right", "bottom")) +
  tm_scale_bar(position = c("right", "bottom"))

# Finally, we can create a dynamic "live" map using tmaps:
tmap_options(check.and.fix = TRUE) 
tmap_mode("view")
map

tmap_save(map, "UK_stop_and_search.html")
tmap_mode("plot")




map <- tm_shape(map_to_plot_filtered) +
  tm_polygons("searches_per_1000_population")

# Let's improve the map by creating more visual variation with our bins: 
breaks <- classInt::classIntervals(map_to_plot_filtered$searches_per_1000_population, n = 5, style = "jenks", dig.lab=10)
breaks

map_to_plot_filtered <- map_to_plot_filtered |>
  mutate(number_of_searches_cut = cut(searches_per_1000_population, breaks$brks, include.lowest = TRUE, dig.lab=10))

map <- tm_shape(map_to_plot_filtered) +
  tm_polygons("average_number_of_searches", palette = "Reds", title = "Number of Searches Per 1000 Population") +
  tm_layout(main.title = " UK Police Stop and Search, 2020-2023", legend.outside = TRUE, frame = FALSE)+
  tm_compass(position = c("right", "bottom")) +
  tm_scale_bar(position = c("right", "bottom"))

# Finally, we can create a dynamic "live" map using tmaps:
tmap_options(check.and.fix = TRUE) 
tmap_mode("view")
map

tmap_save(map, "UK_stop_and_search.html")
tmap_mode("plot")
library(dplyr)
library(lubridate)

# Parse the datetime, extract the hour, and create the daytime range
stop_and_searches_2021_2023 <- stop_and_searches_2021_2023 %>%filter(!is.na(datetime)) %>%
  mutate(datetime = ymd_hms(datetime), # Parse the datetime string
         hour = hour(datetime), # Extract the hour
         daytime = sprintf("%02d:00-%02d:59", hour, hour)) # Format hour range

# Count the number of records for each hour range
hourly_counts <- stop_and_searches_2021_2023 %>%
  group_by(daytime) %>%
  summarise(count = n())

# Calculate the total number of occurrences
total_count <- sum(hourly_counts$count)

# Calculate the percentage for each hour range
hourly_counts <- hourly_counts %>%
  mutate(percentage = (count / total_count) * 100)

# View the result
#print(hourly_counts)
library(ggplot2)

# Plotting the bar chart
ggplot(hourly_counts, aes(x = daytime, y = count)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  geom_text(aes(label = count), vjust = -0.3, color = "black", size = 2.5) + 
  labs(x = "Day Time", 
       y = "Number of searches", 
       title = "Number of Search Frequency of Searches by Day Time in Hour") +
  
  theme_minimal()+
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) # Rotate x-axis labels for better readability
library(dplyr)
library(lubridate)

# Add a new column for the day of the week
stop_and_searches_2021_2023 <- stop_and_searches_2021_2023 %>%filter(!is.na(datetime)) %>%
  mutate(datetime = ymd_hms(datetime), # Parse the datetime string
         day_of_week = wday(datetime, label = TRUE)) # Extract day of week with label

# Count the number of searches for each day of the week
day_of_week_counts <- stop_and_searches_2021_2023 %>%
  group_by(day_of_week) %>%
  summarise(count = n())

# Calculate the total number of searches
total_searches <- sum(day_of_week_counts$count)

# Calculate the percentage for each day of the week
day_of_week_counts <- day_of_week_counts %>%
  mutate(percentage = (count / total_searches) * 100)

# View the result
#print(day_of_week_counts)

library(ggplot2)

# Plotting the bar chart
ggplot(day_of_week_counts, aes(x = day_of_week, y = count)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  geom_text(aes(label = count), vjust = -0.3, color = "black", size = 2.5) + 
  labs(x = "Day", 
       y = "Number of searches", 
       title = "Number of Search Frequency of Searches by Day ") +
  
  theme_minimal()

# this chunk generates the complete code appendix. 
# eval=FALSE tells R not to run (``evaluate'') the code here (it was already run before).

# Below shows how to use open statistical data from financial year to plot the map, which has the data of missing forces
#install.packages(c("DBI", "RSQLite"))
library(DBI)


stop_search_mar21_mar23 <- read.csv("stop-search-open-data-tables-mar21-mar23.csv")


dbWriteTable(db, "stop_search_mar21_mar23", stop_search_mar21_mar23, row.names = FALSE, overwrite = TRUE)
tables <- dbListTables(db)
#print(tables)
#dbListFields(db,"stop_search_mar21_mar23")

force_df_sum2 <- dbGetQuery(db, "
 SELECT 
     police_force_area,
    SUM(ss.number_of_searches) AS number_of_searches,
    SUM(ss.number_of_searches)/3 AS average_number_of_searches,
    (SUM(ss.number_of_searches) * 1000.0) / 
     (3* (SELECT SUM(REPLACE(p.Population, ',', '') + 0)FROM population_by_force_by_ethic_21census p 
       WHERE p.`Police.Force.Area` = ss.police_force_area) )AS searches_per_1000_population,
    (SUM(ss.number_of_searches) * 100.0) / (SELECT SUM(number_of_searches) FROM stop_search_mar21_mar23 WHERE police_force_area NOT IN ('Unknown', 'N/A - vehicle search') ) AS search_percentage,
    SUM(CASE WHEN link IN ('Linked', 'Not linked', 'Unknown link') THEN 1.0 ELSE 0 END) / NULLIF(SUM(ss.number_of_searches), 0) * 100 AS percentage_reasonable
FROM stop_search_mar21_mar23 ss
WHERE police_force_area NOT IN ('Unknown', 'N/A - vehicle search') 
GROUP BY police_force_area
")
force_df_sum2

force_df_year2 <- dbGetQuery(db, "
 SELECT 
     financial_year,
     police_force_area,
     
    SUM(ss.number_of_searches) AS number_of_searches,
    
    (SUM(ss.number_of_searches) * 1000.0) / 
     ((SELECT SUM(REPLACE(p.Population, ',', '') + 0)FROM population_by_force_by_ethic_21census p 
       WHERE p.`Police.Force.Area` = ss.police_force_area) )AS searches_per_1000_population,
       
       (SUM(ss.number_of_searches) * 100.0) / (SELECT SUM(number_of_searches) FROM stop_search_mar21_mar23 WHERE financial_year = ss.financial_year AND police_force_area NOT IN ('Unknown', 'N/A - vehicle search') ) AS search_percentage,
       
    SUM(CASE WHEN link IN ('Linked', 'Not linked', 'Unknown link') THEN 1.0 ELSE 0 END) / NULLIF(SUM(ss.number_of_searches), 0) * 100 AS percentage_reasonable
    
FROM stop_search_mar21_mar23 ss
WHERE police_force_area NOT IN ('Unknown', 'N/A - vehicle search') 
GROUP BY financial_year,police_force_area
")
force_df_year2

# Update the 'Name' column
library(dplyr)
library(stringr)

all_kml_data_update <- all_kml_data %>%
  mutate(Name = gsub("-", " ", Name),           # Replace "-" with space
         Name = gsub(" and ", " & ", Name),     # Replace " and " with " & "
         Name = str_to_title(Name),             # Capitalize each word
         # Handle specific cases
         Name = if_else(Name == "City Of London", "London, City of", Name),
         Name = if_else(Name == "Dyfed Powys", "Dyfed-Powys", Name),
         Name = if_else(Name == "Metropolitan", "Metropolitan Police", Name))

all_kml_data_update

#Now the dataframe `all_kml_data_update` is ready to do a join with `force_df_sum` based on the column matching ` police_force_area == Name`
# Unique names from all_kml_data_update
unique_names <- unique(all_kml_data_update$Name)

# Unique police_force_area from force_df_sum
unique_police_force_areas2 <- unique(force_df_sum2$police_force_area)

# Check unmatched names in both directions
unmatched_names_in_force_df_sum2 <- unique_police_force_areas2[!unique_police_force_areas2 %in% unique_names]
unmatched_names_in_all_kml_data_update <- unique_names[!unique_names %in% unique_police_force_areas2]

# Print unmatched names
print("Unmatched Names in force_df_sum:")
print(unmatched_names_in_force_df_sum2)

print("Unmatched Names in all_kml_data_update:")
print(unmatched_names_in_all_kml_data_update)

#The official boundary does not contain the geometry of British Transport Police. So I did a research by running a web scrapping to collect the introduction and geometry for British Transport Police from Wikipedia. But then I found it is a national special police force that polices the railway network of England, Wales and Scotland. The boundary for it is the whole GB.


library(dplyr)
library(RSelenium)
library(rvest)
library("netstat")
library(stringr)
 driver <- rsDriver(browser = "firefox",  port = free_port(random = TRUE), chromever = NULL) 
  remDr <- driver[["client"]]
  url<-"https://en.wikipedia.org/wiki/British_Transport_Police"
  remDr$navigate(url)
  
  html <- read_html(url)
Intro_BTP <- html%>% html_elements(css=" .mw-content-ltr > p:nth-child(5)" )
Intro_BTP_text <- Intro_BTP %>% html_text() 
Intro_BTP_text

# Scrape Operations jurisdiction information using CSS selector
operations_jurisdiction <- html %>% html_elements(".infobox-data")
operations_jurisdiction <- operations_jurisdiction[html %>% html_elements(".infobox-label") %>% html_text() == "Operations jurisdiction"] %>% html_text()
print(operations_jurisdiction)

      # Scrape Headquarters coordinates using CSS selector
      coordinates <- html %>% html_elements(".geo-default") %>% html_text()
      coordinates <- ifelse(length(coordinates) > 0, coordinates[[1]], "N/A")
      
      # Extract the part after the slash and remove unnecessary characters
coordinates <- sub(".* / ", "", coordinates)
coordinates <- gsub("; ", " ", coordinates)

print(coordinates)
#combine 2 dataframe together using full join
map_to_plot2<- full_join(force_df_sum2 , all_kml_data_update, by = c("police_force_area" = "Name"))

# After the join, map_to_plot will include all rows from force_df_sum and matching rows from all_kml_data_update, including the one with empty geometry.

# Filter out rows with empty geometry
map_to_plot_filtered2 <- map_to_plot %>%
  filter(!st_is_empty(geometry))

# This will remove rows where the geometry is empty (including 'MULTIPOLYGON Z EMPTY')

library(sf)

# Check if map_to_plot is still an sf object
class(map_to_plot_filtered2)

# If map_to_plot is not an sf object,I  need to convert it back

if (!"sf" %in% class(map_to_plot_filtered2)) {
  map_to_plot_filtered2 <- st_as_sf(map_to_plot_filtered2)
}

# Check the class again
class(map_to_plot_filtered2)

# Check rows with empty geometry
rows_with_empty_geometry <- map_to_plot_filtered2[which(st_is_empty(st_geometry(map_to_plot_filtered2))), ]
rows_with_empty_geometry <- map_to_plot_filtered2[which(st_is_empty(st_geometry(map_to_plot_filtered2))), ]

# Count the number of such rows
num_rows_with_empty_geometry <- nrow(rows_with_empty_geometry)

# Print the count
print(paste("Number of rows with 'MULTIPOLYGON Z EMPTY' geometry:", num_rows_with_empty_geometry))

# Optionally, view the rows with empty geometry
if (num_rows_with_empty_geometry > 0) {
  print(rows_with_empty_geometry$police_force_area)
}

library(sf)
library(ggplot2)
library(dplyr)
install.packages("ggspatial")
library(ggspatial)

# Identify the highest value and its location
highest_value <- max(map_to_plot_filtered2$searches_per_1000_population, na.rm = TRUE)

# Identify the highest value and second highest value
highest_value <- max(map_to_plot_filtered2$searches_per_1000_population, na.rm = TRUE)
second_highest_value <- sort(map_to_plot_filtered2$searches_per_1000_population, decreasing = TRUE)[2]
highest_location <- st_centroid(map_to_plot_filtered2[which.max(map_to_plot_filtered2$searches_per_1000_population), ])

# Plot with ggplot
map_rate <- ggplot(map_to_plot_filtered) +
  geom_sf(aes(fill = searches_per_1000_population)) +
  geom_sf(data = highest_location, color = "red", size = 3, shape = 8) + # Shape 8 is a star
  scale_fill_viridis_c(
    name = "Searches per 1000 Population",
    limits = c(0, second_highest_value),  # Set the limits excluding the highest value
    option = "D"
  ) +
  labs(
    caption = "Note: \n• The red star indicates the force area with the highest value of searches per 1000 population:City of London at 295. \n• Grey Area indicates where stop and search data is missing due to technical issues from DATA.POLICE.UK"
  ) +
  theme_minimal() +
  theme(legend.position = "right", plot.caption = element_text(hjust = 0))+
  annotation_scale(location = "bl")
  

# Print the map
print(map_rate)

library("plotly")
#ggplotly(map_rate). #interactive map plot to help discover regional difference in detail
library(sf)
library(dplyr)

map <- tm_shape(map_to_plot_filtered) +
  tm_polygons("average_number_of_searches")

# Let's improve the map by creating more visual variation with our bins: 
breaks <- classInt::classIntervals(map_to_plot_filtered$average_number_of_searches, n = 5, style = "jenks", dig.lab=10)
breaks

map_to_plot_filtered <- map_to_plot_filtered |>
  mutate(number_of_searches_cut = cut(average_number_of_searches, breaks$brks, include.lowest = TRUE, dig.lab=10))

map <- tm_shape(map_to_plot_filtered) +
  tm_polygons("average_number_of_searches", palette = "Reds", title = "Average Number of Searches") +
  tm_layout(main.title = " UK Police Stop and Search, 2020-2023", legend.outside = TRUE, frame = FALSE)
  
map
tmap_save(map, "UK_stop_and_search.png", dpi = 300)
# Finally, we can create a dynamic "live" map using tmaps:
tmap_options(check.and.fix = TRUE) 
tmap_mode("view")
map

tmap_save(map, "UK_stop_and_search.html")
tmap_mode("plot")
##           police_force_area number_of_searches average_number_of_searches
## 1           Avon & Somerset              23637                       7879
## 2              Bedfordshire              11690                       3896
## 3  British Transport Police              35476                      11825
## 4            Cambridgeshire               9082                       3027
## 5                  Cheshire              23771                       7923
## 6                 Cleveland              19696                       6565
## 7                   Cumbria              10528                       3509
## 8                Derbyshire               5992                       1997
## 9          Devon & Cornwall              20603                       6867
## 10                   Dorset               6686                       2228
## 11                   Durham               8002                       2667
## 12              Dyfed-Powys              15209                       5069
## 13                    Essex              61911                      20637
## 14          Gloucestershire               7360                       2453
## 15       Greater Manchester              51693                      17231
## 16                    Gwent               9176                       3058
## 17                Hampshire              32679                      10893
## 18            Hertfordshire              24103                       8034
## 19               Humberside              18730                       6243
## 20                     Kent              39072                      13024
## 21               Lancashire              34623                      11541
## 22           Leicestershire              19180                       6393
## 23             Lincolnshire               9730                       3243
## 24          London, City of               7602                       2534
## 25               Merseyside             147522                      49174
## 26      Metropolitan Police             706344                     235448
## 27                  Norfolk              16694                       5564
## 28              North Wales              14000                       4666
## 29          North Yorkshire               9538                       3179
## 30         Northamptonshire               9862                       3287
## 31              Northumbria              14747                       4915
## 32          Nottinghamshire              14344                       4781
## 33              South Wales              33250                      11083
## 34          South Yorkshire              44434                      14811
## 35            Staffordshire              15278                       5092
## 36                  Suffolk              13606                       4535
## 37                   Surrey              17293                       5764
## 38                   Sussex              20400                       6800
## 39            Thames Valley              47044                      15681
## 40             Warwickshire               5394                       1798
## 41              West Mercia              14621                       4873
## 42            West Midlands              80983                      26994
## 43           West Yorkshire              55603                      18534
## 44                Wiltshire               5699                       1899
##    searches_per_1000_population search_percentage percentage_reasonable
## 1                      4.516284         1.3183765             17.222998
## 2                      5.529257         0.6520210             19.033362
## 3                            NA         1.9787081             19.810576
## 4                      3.384322         0.5065573             21.096675
## 5                      7.233768         1.3258504             13.449161
## 6                     11.530610         1.0985634             10.773761
## 7                      7.020815         0.5872093             11.559650
## 8                      1.891427         0.3342096             20.594126
## 9                      3.840951         1.1491522             13.837791
## 10                     2.858089         0.3729181             20.909363
## 11                     4.234783         0.4463193             13.046738
## 12                     9.826859         0.8482966              8.488395
## 13                    11.093974         3.4531457              9.072701
## 14                     3.803205         0.4105111             21.970109
## 15                     6.008539         2.8832269             11.694040
## 16                     5.204593         0.5118002             14.625109
## 17                     5.451213         1.8227027             13.133817
## 18                     6.701963         1.3443680             14.587396
## 19                     6.671090         1.0446838             12.909770
## 20                     7.017809         2.1792784             14.424652
## 21                     7.537472         1.9311312              9.924039
## 22                     5.698234         1.0697830             15.683003
## 23                     4.221079         0.5427001             15.436793
## 24                   295.200373         0.4240089             24.533018
## 25                    34.549627         8.2281817              3.743848
## 26                    26.782383        39.3970172              3.297827
## 27                     6.074128         0.9311239             12.154067
## 28                     6.793650         0.7808635             11.478571
## 29                     3.885196         0.5319911             14.971692
## 30                     4.186374         0.5500626             18.535794
## 31                     3.395360         0.8225281             10.829321
## 32                     4.163260         0.8000504             17.909927
## 33                     8.415514         1.8545508              9.714286
## 34                    10.771840         2.4783492              9.882072
## 35                     4.488992         0.8521452             15.185234
## 36                     5.962100         0.7588878             15.522564
## 37                     4.791182         0.9645337             14.236974
## 38                     3.986806         1.1378297             17.107843
## 39                     6.234617         2.6239244             18.788793
## 40                     3.012861         0.3008555             27.790137
## 41                     3.749381         0.8155004             19.725053
## 42                     9.245727         4.5169049             11.678994
## 43                     7.881655         3.1013109              8.512131
## 44                     2.554194         0.3178672             17.231093
##     financial_year        police_force_area number_of_searches
## 1          2020/21          Avon & Somerset              10386
## 2          2020/21             Bedfordshire               4028
## 3          2020/21 British Transport Police              12885
## 4          2020/21           Cambridgeshire               3551
## 5          2020/21                 Cheshire               5831
## 6          2020/21                Cleveland               6350
## 7          2020/21                  Cumbria               3594
## 8          2020/21               Derbyshire               2214
## 9          2020/21         Devon & Cornwall               8420
## 10         2020/21                   Dorset               2678
## 11         2020/21                   Durham               2379
## 12         2020/21              Dyfed-Powys               3463
## 13         2020/21                    Essex              26043
## 14         2020/21          Gloucestershire               3202
## 15         2020/21       Greater Manchester              11741
## 16         2020/21                    Gwent               4842
## 17         2020/21                Hampshire              12782
## 18         2020/21            Hertfordshire               9881
## 19         2020/21               Humberside               7424
## 20         2020/21                     Kent              13903
## 21         2020/21               Lancashire              11165
## 22         2020/21           Leicestershire               6932
## 23         2020/21             Lincolnshire               3535
## 24         2020/21          London, City of               2743
## 25         2020/21               Merseyside              48818
## 26         2020/21      Metropolitan Police             316747
## 27         2020/21                  Norfolk               7382
## 28         2020/21              North Wales               6201
## 29         2020/21          North Yorkshire               3797
## 30         2020/21         Northamptonshire               2835
## 31         2020/21              Northumbria               5017
## 32         2020/21          Nottinghamshire               5089
## 33         2020/21              South Wales              16248
## 34         2020/21          South Yorkshire              18841
## 35         2020/21            Staffordshire               5930
## 36         2020/21                  Suffolk               5250
## 37         2020/21                   Surrey               7454
## 38         2020/21                   Sussex               8729
## 39         2020/21            Thames Valley              18982
## 40         2020/21             Warwickshire               2415
## 41         2020/21              West Mercia               6042
## 42         2020/21            West Midlands              25895
## 43         2020/21           West Yorkshire              21047
## 44         2020/21                Wiltshire               2223
## 45         2021/22          Avon & Somerset               6915
## 46         2021/22             Bedfordshire               3671
## 47         2021/22 British Transport Police              10467
## 48         2021/22           Cambridgeshire               2665
## 49         2021/22                 Cheshire               5997
## 50         2021/22                Cleveland               6210
## 51         2021/22                  Cumbria               2277
## 52         2021/22               Derbyshire               1735
## 53         2021/22         Devon & Cornwall               5856
## 54         2021/22                   Dorset               2115
## 55         2021/22                   Durham               2209
## 56         2021/22              Dyfed-Powys               2669
## 57         2021/22                    Essex              18248
## 58         2021/22          Gloucestershire               1946
## 59         2021/22       Greater Manchester               9642
## 60         2021/22                    Gwent               1977
## 61         2021/22                Hampshire               9791
## 62         2021/22            Hertfordshire               7335
## 63         2021/22               Humberside               6083
## 64         2021/22                     Kent              12139
## 65         2021/22               Lancashire              10632
## 66         2021/22           Leicestershire               5633
## 67         2021/22             Lincolnshire               2887
## 68         2021/22          London, City of               2672
## 69         2021/22               Merseyside              46059
## 70         2021/22      Metropolitan Police             212107
## 71         2021/22                  Norfolk               5335
## 72         2021/22              North Wales               4191
## 73         2021/22          North Yorkshire               2518
## 74         2021/22         Northamptonshire               3156
## 75         2021/22              Northumbria               4243
## 76         2021/22          Nottinghamshire               4466
## 77         2021/22              South Wales               9900
## 78         2021/22          South Yorkshire              12854
## 79         2021/22            Staffordshire               4658
## 80         2021/22                  Suffolk               4217
## 81         2021/22                   Surrey               4974
## 82         2021/22                   Sussex               6183
## 83         2021/22            Thames Valley              13843
## 84         2021/22             Warwickshire               1766
## 85         2021/22              West Mercia               4334
## 86         2021/22            West Midlands              26372
## 87         2021/22           West Yorkshire              16329
## 88         2021/22                Wiltshire               1694
## 89         2022/23          Avon & Somerset               6336
## 90         2022/23             Bedfordshire               3991
## 91         2022/23 British Transport Police              12124
## 92         2022/23           Cambridgeshire               2866
## 93         2022/23                 Cheshire              11943
## 94         2022/23                Cleveland               7136
## 95         2022/23                  Cumbria               4657
## 96         2022/23               Derbyshire               2043
## 97         2022/23         Devon & Cornwall               6327
## 98         2022/23                   Dorset               1893
## 99         2022/23                   Durham               3414
## 100        2022/23              Dyfed-Powys               9077
## 101        2022/23                    Essex              17620
## 102        2022/23          Gloucestershire               2212
## 103        2022/23       Greater Manchester              30310
## 104        2022/23                    Gwent               2357
## 105        2022/23                Hampshire              10106
## 106        2022/23            Hertfordshire               6887
## 107        2022/23               Humberside               5223
## 108        2022/23                     Kent              13030
## 109        2022/23               Lancashire              12826
## 110        2022/23           Leicestershire               6615
## 111        2022/23             Lincolnshire               3308
## 112        2022/23          London, City of               2187
## 113        2022/23               Merseyside              52645
## 114        2022/23      Metropolitan Police             177490
## 115        2022/23                  Norfolk               3977
## 116        2022/23              North Wales               3608
## 117        2022/23          North Yorkshire               3223
## 118        2022/23         Northamptonshire               3871
## 119        2022/23              Northumbria               5487
## 120        2022/23          Nottinghamshire               4789
## 121        2022/23              South Wales               7102
## 122        2022/23          South Yorkshire              12739
## 123        2022/23            Staffordshire               4690
## 124        2022/23                  Suffolk               4139
## 125        2022/23                   Surrey               4865
## 126        2022/23                   Sussex               5488
## 127        2022/23            Thames Valley              14219
## 128        2022/23             Warwickshire               1213
## 129        2022/23              West Mercia               4245
## 130        2022/23            West Midlands              28716
## 131        2022/23           West Yorkshire              18227
## 132        2022/23                Wiltshire               1782
##     searches_per_1000_population search_percentage percentage_reasonable
## 1                       5.953309         1.4527622             15.357212
## 2                       5.715615         0.5634244             16.757696
## 3                             NA         1.8023147             17.873496
## 4                       3.969740         0.4967031             18.220220
## 5                       5.323306         0.8156226             11.661808
## 6                      11.152423         0.8882187              9.401575
## 7                       7.190200         0.5027178             13.466889
## 8                       2.096605         0.3096876             20.099368
## 9                       4.709141         1.1777640             13.610451
## 10                      3.434324         0.3745905             19.454817
## 11                      3.777012         0.3327673             13.114754
## 12                      6.712554         0.4843939              9.240543
## 13                     14.000115         3.6428158              7.971432
## 14                      4.963802         0.4478860             18.457214
## 15                      4.094148         1.6422954             14.990205
## 16                      8.239094         0.6772843             12.515489
## 17                      6.396530         1.7879074             13.135660
## 18                      8.242388         1.3821243             12.711264
## 19                      7.932649         1.0384466             11.301185
## 20                      7.491446         1.9447094             11.393224
## 21                      7.291905         1.5617263              8.472906
## 22                      6.178336         0.9696271             15.002885
## 23                      4.600672         0.4944651             14.455446
## 24                    319.547996         0.3836825             23.076923
## 25                     34.299501         6.8285136              3.797780
## 26                     36.030204        44.3056088              2.559772
## 27                      8.057844         1.0325717              9.834733
## 28                      9.027305         0.8673771              9.192066
## 29                      4.639994         0.5311128             12.562549
## 30                      3.610334         0.3965512             20.000000
## 31                      3.465353         0.7017627             10.524218
## 32                      4.431155         0.7118339             15.445078
## 33                     12.337017         2.2727209              7.656327
## 34                     13.702496         2.6354219              7.743750
## 35                      5.227069         0.8294704             14.451939
## 36                      6.901592         0.7343541             13.257143
## 37                      6.195594         1.0426429             12.463107
## 38                      5.117769         1.2209860             15.293848
## 39                      7.546903         2.6551445             20.608998
## 40                      4.046751         0.3378029             21.159420
## 41                      4.648196         0.8451366             17.725919
## 42                      8.869199         3.6221140             10.708631
## 43                      8.950156         2.9439905              6.575759
## 44                      2.988932         0.3109465             14.035088
## 45                      3.963714         1.3023335             17.700651
## 46                      5.209043         0.6913762             19.803868
## 47                            NA         1.9712978             19.422948
## 48                      2.979261         0.5019116             22.326454
## 49                      5.474852         1.1294423             15.591129
## 50                     10.906543         1.1695576             11.095008
## 51                      4.555394         0.4288378             13.746157
## 52                      1.643003         0.3267605             22.247839
## 53                      3.275146         1.1028872             13.439208
## 54                      2.712321         0.3983276             21.229314
## 55                      3.507112         0.4160310             13.942961
## 56                      5.173493         0.5026649             11.165230
## 57                      9.809703         3.4367290              9.135248
## 58                      3.016727         0.3664990             27.286742
## 59                      3.362215         1.8159218             15.764364
## 60                      3.364041         0.3723374             16.034396
## 61                      4.899736         1.8439837             13.277500
## 62                      6.118603         1.3814340             14.587594
## 63                      6.499772         1.1456391             12.181489
## 64                      6.540938         2.2861932             14.622292
## 65                      6.943800         2.0023730             10.205041
## 66                      5.020566         1.0608886             16.882656
## 67                      3.757324         0.5437219             17.353654
## 68                    311.276794         0.5032299             22.642216
## 69                     32.361029         8.6745014              3.669207
## 70                     24.127327        39.9470780              3.548209
## 71                      5.823435         1.0047649             10.946579
## 72                      6.101183         0.7893101             13.171081
## 73                      3.077036         0.4742264             15.647339
## 74                      4.019123         0.5943839             20.025349
## 75                      2.930734         0.7991035             11.430592
## 76                      3.888689         0.8411021             18.092253
## 77                      7.517016         1.8645121             10.141414
## 78                      9.348330         2.4208524             10.697059
## 79                      4.105849         0.8772624             14.877630
## 80                      5.543622         0.7942068             16.741760
## 81                      4.134275         0.9367761             14.957780
## 82                      3.625062         1.1644726             17.224648
## 83                      5.503729         2.6071153             20.371307
## 84                      2.959239         0.3325988             23.669309
## 85                      3.334208         0.8162420             20.973696
## 86                      9.032574         4.9667590             11.398453
## 87                      6.943845         3.0753150              8.916651
## 88                      2.277665         0.3190387             17.768595
## 89                      3.631828         1.1583117             19.760101
## 90                      5.663114         0.7296121             20.621398
## 91                            NA         2.2164412             22.203893
## 92                      3.203964         0.5239459             23.517097
## 93                     10.903145         2.1833518             13.246253
## 94                     12.532865         1.3045632             11.715247
## 95                      9.316851         0.8513664              9.018682
## 96                      1.934672         0.3734897             19.725893
## 97                      3.538567         1.1566664             14.509246
## 98                      2.427623         0.3460676             22.609614
## 99                      5.420226         0.6241282             12.419449
## 100                    17.594529         1.6594059              7.414344
## 101                     9.472105         3.2211889             10.635641
## 102                     3.429085         0.4043854             22.377939
## 103                    10.569254         5.5411031              9.122402
## 104                     4.010645         0.4308934             17.776835
## 105                     5.057372         1.8475219             12.992282
## 106                     5.744897         1.2590425             17.278931
## 107                     5.580850         0.9548394             16.044419
## 108                     7.021042         2.3820710             17.475058
## 109                     8.376710         2.3447769             10.954312
## 110                     5.895801         1.2093170             15.374150
## 111                     4.305240         0.6047499             14.812576
## 112                   254.776328         0.3998150             28.669410
## 113                    36.988349         9.6242617              3.759141
## 114                    20.189618        32.4477197              4.315736
## 115                     4.341106         0.7270527             18.078954
## 116                     5.252462         0.6595942             13.442350
## 117                     3.938557         0.5892107             17.282035
## 118                     4.929665         0.7076744             16.249031
## 119                     3.789993         1.0031024             10.643339
## 120                     4.169936         0.8754979             20.359156
## 121                     5.392510         1.2983475             13.827091
## 122                     9.264694         2.3288721             12.222309
## 123                     4.134056         0.8573993             16.417910
## 124                     5.441084         0.7566686             17.153902
## 125                     4.043677         0.8893918             16.217883
## 126                     3.217587         1.0032852             19.861516
## 127                     5.653219         2.5994373             14.818201
## 128                     2.032592         0.2217538             46.990932
## 129                     3.265739         0.7760469             21.295642
## 130                     9.835409         5.2496970             12.811673
## 131                     7.750962         3.3321572             10.385692
## 132                     2.395986         0.3257752             20.707071
## Simple feature collection with 44 features and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XYZ
## Bounding box:  xmin: -8.624268 ymin: 49.8821 xmax: 1.768976 ymax: 55.81167
## z_range:       zmin: 0 zmax: 0
## Geodetic CRS:  WGS 84
## First 10 features:
##                Name Description                       geometry
## 1   Avon & Somerset             MULTIPOLYGON Z (((-3.383222...
## 2      Bedfordshire             MULTIPOLYGON Z (((-0.354841...
## 3    Cambridgeshire             MULTIPOLYGON Z (((0.435156 ...
## 4          Cheshire             MULTIPOLYGON Z (((-2.699278...
## 5   London, City of             MULTIPOLYGON Z (((-0.078929...
## 6         Cleveland             MULTIPOLYGON Z (((-1.145833...
## 7           Cumbria             MULTIPOLYGON Z (((-2.981906...
## 8        Derbyshire             MULTIPOLYGON Z (((-1.571625...
## 9  Devon & Cornwall             MULTIPOLYGON Z (((-3.7702 5...
## 10           Dorset             MULTIPOLYGON Z (((-1.900296...
## [1] "Unmatched Names in force_df_sum:"
## [1] "British Transport Police"
## [1] "Unmatched Names in all_kml_data_update:"
## [1] "Northern Ireland"
## [1] "data.frame"
## [1] "sf"         "data.frame"
## [1] "Number of rows with 'MULTIPOLYGON Z EMPTY' geometry: 0"
## 
## The downloaded binary packages are in
##  /var/folders/w_/2mj5l0j94159cp96r28wxjr00000gn/T//RtmpeUu3cW/downloaded_packages
## Warning: st_centroid assumes attributes are constant over geometries
## st_as_s2(): dropping Z and/or M coordinate
## Scale on map varies by more than 10%, scale bar may be inaccurate

## Warning in
## classInt::classIntervals(map_to_plot_filtered$average_number_of_searches, : var
## has missing values, omitted in finding classes
## style: jenks
##   one of 82,251 possible partitions of this variable into 5 classes
##    [1307,7565]   (7565,17296]  (17296,27260]  (27260,49307] (49307,169396] 
##             30              7              1              1              1

## Map saved to /Users/shengyuwen/Downloads/MY472_final_assignment-main/UK_stop_and_search.png
## Resolution: 2160.642 by 2041.06 pixels
## Size: 7.20214 by 6.803533 inches (300 dpi)
## tmap mode set to interactive viewing
## Interactive map saved to /Users/shengyuwen/Downloads/MY472_final_assignment-main/UK_stop_and_search.html
## tmap mode set to plotting